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1.  Introduction 


Spatial  awareness  is  the  awareness  of  the  surrounding  space  and  the  location  and  position  of  our 
own  body  within  it.  Thus,  it  is  the  multisensory  awareness  of  being  immersed  in  a  specific  real 
or  virtual  environment.  The  surrounding  environment  may  be  static  or  dynamic. ^  In  a  dynamic 
environment,  changes  in  the  environment  may  result  from  movements  of  surrounding  objects, 
the  observer,  or  both.  Awareness  of  the  dynamic  changes  in  the  environment  may  also  change  as 
a  result  of  the  duration  of  exposure,  a  global  change  in  the  environmental  conditions  (e.g., 
amount  of  lighting),  or  changes  in  the  physiological  or  psychological  status  of  the  observer.  This 
awareness  is  not  an  on-or-off  phenomenon,  and  its  extent  can  be  assessed  by  its  completeness 
and  how  well  it  matches  the  actual  physical  or  virtual  environment.  However,  since  awareness  is 
a  perceptual  phenomenon,  its  correspondence  to  the  physical  or  virtual  environment  is  not 
always  casual  and  must  be  considered  carefully.  The  physical  environment  may  include 
misleading  or  confusing  clues,  or  its  synthetic  realization  (virtual  reality)  may  be  flawed.  Certain 
real  properties  of  the  environment  may  not  be  generally  perceived  as  they  truly  are,  e.g.,  vection 
illusion.  Therefore,  the  assessment  of  spatial  awareness  must  take  into  account  both  the  absolute 
physical  reality  and  the  statistical  (perceptual)  reality  based  on  commonality  of  experience. 

Spatial  awareness  resulting  from  auditory  stimulation  is  commonly  referred  to  as  auditory  spatial 
awareness.  Auditory  spatial  awareness  is  the  awareness  of  the  presence,  distribution,  and 
interaction  of  sound  sources  in  the  surrounding  space.  It  is  an  element  of  spatial  awareness  and 
auditory  awareness,  which  also  includes  sound  source  detection  and  acoustic  signal  recognition. 
The  extent  of  auditory  spatial  awareness  in  a  given  environment  depends  on  the  physiological 
status  of  the  listener’s  sense  of  hearing,  their  auditory  experience,  knowledge  of  listening 
strategies,  familiarity  with  the  surrounding  environment,  and  degree  of  involvement  in  the 
listening  activity  (motivation,  attention,  tiredness,  etc.).  It  also  depends  on  the  type  and  extent  of 
protective  headgear  worn  by  the  individual. 

Auditory  spatial  awareness  is  a  three-dimensional  (3-D)  ability;  hearing  is  the  only  directional 
human  telereceptor  that  operates  in  a  full  360°  range  and  is  equally  effective  in  darkness  as  in 
bright  light.  Thus,  the  auditory  system  is  frequently  a  guiding  system  for  vision  in  determining 
the  exact  location  and  visual  properties  of  a  given  object.  Simple  reaction  time  (SRT)  to 
auditory  stimuli  is  also  shorter  than  that  to  other  sensory  stimuli  (e.g.,  visual  stimuli).  Auditory 
SRTs  are  typically  on  the  order  of  100-160  ms,  whereas  visual  SRTs  are  in  the  200-250  ms 
range  (Carterette,  1989,  p.  91).  Similarly,  Welch  and  Warren  (1986)  listed  auditory  SRTs  as 


^Ericson  et  al.  (1991)  reported  choice  reaction  time  (CRT)  for  an  auditory  localization  task  on  the  order  of  3. 0-3.5  s 
(broadband  stimuli  arriving  from  any  spherical  angle).  Slightly  longer  times  of  4.0-4.5  s  were  reported  by  both  Ericson  et  al. 
(1991)  and  Endsley  and  Rosiles  (1995)  for  auditory  virtual  reality  scenarios.  Noble  and  Gates  (1985)  observed  that  the  use  of 
hearing  protectors  increased  localization  CRT  of  their  subjects  from  3.0  s  to  5.0  s. 
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30-40  ms  shorter  than  visual  SRTs.  In  fact,  superior  temporal  discrimination  is  the  main  asset  of 
the  auditory  sense  (Kramer,  1994),  and  the  human  ability  to  discern  short-term  changes  in 
arriving  sound  makes  auditory  spatial  perception  an  important  means  for  detecting  early  warning 
signs. 

Auditory  spatial  awareness  results  from  human  abilities  to  identify  the  direction  from  which  a 
sound  is  coming  from,  estimate  the  distance  to  the  sound  source,  and  assess  the  size  and 
character  of  the  surrounding  physical  space  affecting  sound  propagation.  It  also  includes  the 
awareness  of  the  presence  of  ambient  sounds  whose  physical  sources  cannot  be  localized.  The 
first  three  elements  of  auditory  spatial  awareness  are  commonly  referred  to  in  the  psychoacoustic 
literature  as  the  acts  of  auditory  localization,  auditory  distance  estimation,  and  auditory 
spaciousness  assessment. 

The  above  concept  of  auditory  spatial  awareness  separates  the  judgment  of  auditory  distance 
from  the  act  of  auditory  localization.  This  concept  differs  from  the  concept  of  localization 
expressed  in  the  general  literature,  where  localization  is  defined  as  the  act,  process,  or  ability  of 
identifying  the  physical  location  of  an  object — or  the  origin  of  a  given  activity — in  space  (e.g., 
APA,  2007;  Houghton  Mifflin,  2007).  In  the  case  of  Euclidean  space  with  polar  coordinates,  this 
location  is  specified  by  its  azimuth,  elevation,  and  distance.  Therefore,  the  general  definition  of 
localization  treats  distance  estimation  as  one  of  the  elements  of  localization.  However,  it  does 
not  mean  that  this  broad  concept  of  localization  has  to  be  strictly  followed  if  a  different,  narrow 
concept  of  localization  is  more  operationally  useful.  Such  a  narrow  interpretation  of  localization 
is  frequently  adopted  in  the  psychoacoustic  literature  where  auditory  localization  is  defined  as 
the  act  of  identifying  the  direction  toward  the  spatial  location  of  the  sound  source  (e.g..  Illusion, 
2010;  Morfey,  2001;  White,  1987).  In  these  definitions,  the  distance  to  the  sound  source  is  not 
mentioned  and  its  judgment  is  treated  as  a  separate  entity. 

To  avoid  potential  confusion  between  the  broad  and  narrow  meanings  of  the  term  localization, 
some  authors  (e.g.,  Dietz  et  ah,  2011;  Viste  and  Evangelista,  2003)  use  the  term  direction  of 
arrival  (DOA),  a  technical  term  borrowed  from  the  fields  of  radar  and  sonar  (Mathews  and 
Zoltowski,  1994),  to  denote  directional  localization  and  distinguish  it  from  general  localization. 
Eollowing  this  concept,  the  use  of  the  term  auditory  localization  would  be  restricted  to  its  broad 
meaning.  Although  such  an  approach  has  some  merit  from  the  formal  point  of  view,  the  term 
DOA  is  not  normally  used  in  reference  to  humans  and  may,  in  effect,  create  more  rather  than  less 
confusion  since  the  use  of  the  narrow  meaning  of  localization  is  widespread  in  the 
psychoacoustic  literature.  Therefore,  following  the  narrow  interpretation  of  the  term 
localization,  which  is  common  in  the  psychoacoustic  literature,  the  term  auditory  localization 
will  be  used  in  this  report  to  refer  solely  to  directional  judgments. 

1.1  Auditory  Localization 

Auditory  localization  is  the  element  of  auditory  spatial  perception  that  is  the  most  critical  to 
human  effectiveness  and  personal  safety.  The  sound  of  a  weapon,  vehicle,  or  an  approaching 
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person  can  usually  be  heard  much  earlier  than  the  source  of  the  sound  can  be  seen.  Knowing 
where  to  listen  improves  situational  awareness,  speech  perception,  and  sound  source 
identification  in  the  presence  of  other  sound  sources  (e.g.,  Bronkhorst,  2000;  Kidd  et  al.,  2005). 
For  these  reasons,  studies  of  human  auditory  localization  performance  and  the  localization  errors 
made  under  various  listening  conditions  are  ongoing  research  programs  in  many  military 
acoustic  laboratories. 

As  mentioned  earlier,  auditory  localization  is  a  3-D  ability,  but  it  is  normally  discussed  in  the 
literature  as  a  combination  of  two  separate  judgments:  a  horizontal  localization  judgment  and  a 
vertical  localization  judgment.  The  separate  focus  on  the  horizontal  and  vertical  judgments 
simplifies  the  discussion  of  the  effects  of  the  underlying  localization  cues.  However,  a  number 
of  cue-oriented  and  3-D  localization  studies  (as  opposed  to  localization  studies  limited  to  one 
specific  plane)  have  demonstrated  that  horizontal  and  vertical  judgments  are  not  fully 
independent  and  that  they  both  depend  on  the  actual  location  of  the  sound  source  in  both 
directions. 

Localization  judgments  can  range  from  simple  left-right,  up-down,  or  more-less  discrimination 
to  categorical  judgments  to  absolute  identifications  of  specific  directions  in  space.  Two  excellent 
sources  of  information  on  auditory  localization  are  Blauert’s  (1974/2001)  and  Yost  and 
Gourevitch’s  (1987)  books  on  spatial  hearing.  Both  books  provide  a  wealth  of  information  on 
the  effects  of  signal  and  listening  environment  properties  on  monaural  and  binaural  localization 
accuracy  under  various  listening  conditions.  However,  they  only  marginally  address  auditory 
localization  metrics^  and  measurement  methodologies.  This  same  methodological  limitation  is 
true  of  most  other  psychoacoustic  textbooks.  Yet,  the  proper  understanding  of  metrics  and  data 
collection  methods  is  very  important  for  both  the  collection  and  interpretation  of  auditory 
localization  data  since  localization  errors  can  be  defined  and  measured  in  a  variety  of  ways. 

Thus,  the  focus  of  this  report  is  on  localization  metrics  and  data  collection  methodologies. 

Localization  judgments  refer,  in  general,  to  the  locations  of  sound  sources  in  surrounding  space; 
however,  in  some  cases,  the  listeners  may  feel  that  the  sound  sources  are  located  inside  their 
head.  Such  in-the-head  imaginary  (phantom)  sound  sources^  are  commonly  perceived  when 
sound  is  presented  through  earphones  without  pre-processing  it  using  head-related  transfer 
functions  (HRTF)  (see  section  2.2).  In  addition,  such  sensations  may  exist  under  some  open-ear 
conditions  (e.g.,  Gresinger,  1998;  Minnaar,  2010).  For  example,  a  sound  source  may  be 


^In  discussing  localization  metrics,  it  is  important  to  differentiate  between  the  concepts  of  measure  and  metric.  Both  terms 
have  several  dictionary  definitions  and  there  is  a  certain  degree  of  overlap  between  their  meanings.  In  general,  a  measure  is  an 
objective  amount  of  an  attribute  that  is  quantified  against  a  certain  standard.  It  is  the  extent  or  degree  of  something  (e.g.,  a 
measure  of  distance  or  measure  of  central  tendency)  or  a  unit  of  measurement  (e.g.,  a  kilometer  or  standard  deviation).  A  metric 
is  a  measure  applied  to  a  specific  task.  It  is  the  degree  to  which  a  particular  subject  possesses  the  quality  that  is  being  measured. 
For  example,  a  kilometer  is  a  measure  of  distance.  However,  when  the  kilometer  is  used  to  determine  how  far  a  car  can  travel  on 
a  single  tank  of  gas,  it  becomes  a  metric.  In  the  context  of  this  report,  standard  deviation  is  a  general  measure,  but  standard 
deviation  used  to  quantify  the  localization  error  is  a  metric  of  this  error. 

^An  imaginary  (phantom)  sound  source  is  the  perceptual  image  of  a  real  sound  source  that  does  not  coincide  spatially  with  its 
true  location. 
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perceived  as  located  in-the-head  during  listening  to  a  single  sound  source  or  several  sound 
sources  set  at  equal  distances  from  the  listener  in  an  anechoic  chamber  (Toole,  1970).  Similarly, 
in-the-head  location  of  a  phantom  sound  source  may  take  place  when  each  ear  is  stimulated  by  a 
separately  generated  sound  (Plenge,  1974).  The  in-the-head  sound  source  may  appear  to  occupy 
the  whole  head  or  it  may  be  perceived  as  a  more  discrete  object  located  somewhere  along  an 
imaginary  internal  arc  connecting  the  left  and  right  ear.  If  the  in-the-head  sound  source  is 
perceived  as  being  located  closer  to  one  of  the  ears  of  the  listener,  it  is  said  to  be  lateralized 
toward  this  ear.  Consequently,  the  terms  lateralization  and  localization  are  used  in 
psychoacoustic  literature  to  describe  judgments  of  the  in-the-head  and  out-of-the-head  location 
of  a  perceived  sound  source  (Emanuel  and  Letowski,  2009;  Howard  and  Rogers,  2012;  Yost  and 
Hafter,  1987).  These  terms  are  used  regardless  of  whether  the  real  sound  source  is  located 
outside  of  the  head  or  the  sound  is  provided  by  earphones  or  a  bone  conduction  system. 

1.2  Auditory  Distance  Estimation 

Auditory  distance  estimation  is  the  judgment  of  the  distance  from  the  listener  to  the  sound 
source.  This  judgment  may  take  the  form  of  a  simple  discrimination  judgment  (closer-farther),  a 
sequential  ratio  judgment  (half  as  far,  twice  as  far),  or  an  absolute  judgment  in  some  unit  of 
distance.  In  order  for  this  judgment  to  have  real  auditory  meaning,  the  sound  source  has  to  be 
invisible.  In  the  case  of  two  sound  sources  concurrently  emitting  the  sound  and  located  at 
different  distances  from  the  listener,  the  listener  may  estimate  the  relative  difference  in  distance 
between  the  two  sources  using  the  same  types  of  judgments.  Such  relative  judgments  are 
referred  to  as  auditory  distance  difference  or  auditory  depth  judgments.  A  good  summary  of  the 
basic  issues  related  to  auditory  distance  perception  can  be  found  in  Grantham  (1995). 

1.3  Spaciousness  Perception 

The  third  element  of  auditory  spatial  awareness,  spaciousness,  is  the  perception  of  being 
surrounded  by  sound  and  is  related  to  the  type  and  size  of  the  surrounding  space.  It  depends  not 
only  on  the  type  and  volume  of  the  space  but  also  on  the  number,  type,  and  locations  of  the 
sound  sources  in  the  space.  Unlike  horizontal  and  vertical  localization  and  distance  estimation 
judgments,  which  are  made  along  a  single  continuum,  spaciousness  is  a  multidimensional 
phenomenon  that  does  not  yet  have  a  set  of  well-established  dimensions  and  is  usually  described 
in  relative  terms  or  using  categorical  judgments.  Issues  related  to  auditory  spaciousness  are 
covered  in  books  on  concert  hall  acoustics,  music,  and  audio  recording  technologies  (e.g.,  Rasch 
andPlomp,  1999). 

1.4  Goals,  Format,  and  Structure  of  this  Report 

This  report  is  intended  to  provide  a  common  terminological  and  methodological  platform  for 
information  exchange  between  laboratories  investigating  auditory  localization  and  summarize 
the  state-of-the-art  knowledge  about  localization  metrics  and  human  localization  ability.  It  is 
structured  so  as  to  first  describe  the  general  concepts  related  to  spatial  auditory  awareness  and 
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sensory  mapping  of  the  acoustic  environment  and  then  to  use  them  as  a  backdrop  for  a  more 
detailed  discussion  of  the  issues  related  to  the  planning,  execution,  and  analysis  of  auditory 
localization  studies 

The  initial  part  of  the  report  (sections  2-4)  is  concerned  with  the  formal  and  physiological  bases 
of  auditory  localization.  Section  2  starts  with  a  discussion  of  various  localization  cues  and  their 
contributions  to  the  general  localization  ability  of  a  listener  and  is  followed  by  a  review  of  the 
effects  of  age,  gender,  and  hearing  loss  on  localization  performance.  Section  3  is  an  overview  of 
the  neurophysiology  of  spatial  localization.  Although  this  section  is  not  directly  related  to  the 
main  purpose  of  the  report,  it  is  important  for  understanding  ear  pathologies  mentioned  in 
section  12  and  outlines  the  processing  of  spatial  information  by  the  nervous  system  leading  to  the 
build-up  of  auditory  spatial  awareness. 

The  diversity  of  terms  and  points  of  reference  used  in  auditory  localization  publications  together 
with  inconsistent  semantics  has  been  the  source  of  some  confusion  in  data  interpretation. 
Therefore,  section  4  presents  the  basic  terminology  used  in  spatial  research  with  an  emphasis  on 
the  various  systems  of  coordinates  used  to  describe  the  data.  Further,  in  order  to  meaningfully 
interpret  the  character  of  overall  localization  error,  it  is  important  to  determine  both  the  constant 
error  (accuracy)  and  random  error  (precision)  components  of  localization  judgments.  Overall 
error  metrics  like  root  mean  squared  error  and  mean  unsigned  error  represent  a  specific 
combination  of  these  two  error  components  and  do  not  on  their  own  provide  an  adequate 
characterization  of  localization  error.  Overall  localization  error  can  be  used  to  characterize  a 
given  set  of  results  but  does  not  give  any  insight  into  the  underlying  causes  of  the  error.  All 
these  issues  are  discussed  in  section  5,  which  includes  a  discussion  of  some  elements  of 
measurement  theory  and  error  metrology. 

The  main  part  of  the  report  (sections  6-7)  is  devoted  to  the  introduction  of  various  localization 
metrics  and  circular  data  analysis.  Common  linear  metrics  used  to  describe  directional  data, 
along  with  some  more  advanced  metrics,  are  explained  and  compared,  and  their  advantages  and 
limitations  outlined.  However,  the  fundamental  property  of  localization  data  is  that  they  are  by 
their  nature  angular  and  thus  constitute  circular  (spherical)  variables.  Such  data,  in  general, 
cannot  be  described  by  a  linear  distribution  as  assumed  in  classical  statistics.  The  azimuth  and 
elevation  of  sound  source  locations  define  an  ambiguous  conceptual  sphere,  which  can  only  be 
fully  analyzed  with  the  methods  of  spherical  statistics.  The  appropriate  methods  of  statistical 
analysis  for  such  two-dimensional  (2-D)  (circular)  and  3-D  (spherical)  data  are,  respectively,  the 
tools  of  spherical  and  circular  statistics.  However,  if  a  set  of  directional  judgments  is  relatively 
concentrated  around  a  central  direction,  the  differences  between  the  circular  and  linear  metrics 
may  be  minimal,  and  linear  statistics  may  effectively  be  used  in  lieu  of  circular  statistics.  The 
conditions  under  which  the  linear  analysis  of  directional  data  is  justified  are  outlined  in  section  7 
on  circular  data  analysis. 
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The  subsequent  part  of  the  report  (sections  8-12)  provides  a  discussion  of  the  various  types  of 
localization  tasks,  localization  reversal  errors,  and  attempts  to  use  auditory  localization  tasks  in 
clinical  audiology.  The  discussion  is  supported  by  results  of  various  research  studies  in  order  to 
provide  the  reader  with  state-of-the-art  reference  data.  Although  the  focus  of  the  discussions 
conducted  in  sections  8-12  is  on  auditory  localization  in  the  sound  field  with  unoccluded  ears, 
some  data  for  the  earphone-based  auditory  virtual  reality  (AVR)  environments  are  also  provided. 
However,  the  accuracy  of  specific  spatial  renderings  implemented  in  various  AVR  studies  may 
vary  and  affect  localization  data  (e.g.,  Bronkhorst,  1995;  Martin  et  ah,  2001;  Wightman  and 
Kistler,  1989b).  There  are  also  important  differences  in  the  stimuli  used  in  such  studies  that  may 
affect  localization  error.  Therefore,  the  data  reported  in  such  studies  need  to  be  treated  with 
caution. 

The  final  part  of  the  report  includes  a  review  of  complex  localization  scenarios  involving 
multiple  and  moving  sound  sources  (sections  13-14),  a  short  summary  (section  15),  and  two 
methodological  appendices  focused  on  the  effects  of  directional  response  (appendix  A)  and 
listener  learning/practice  (appendix  B)  on  the  results  of  localization  studies.  The  preferred  type 
of  directional  response  and  listener  learning/practice  effects  are  the  two  most  debated  elements 
of  localization  study  methodology.  Therefore,  these  two  appendices  are  intended  to  provide 
background  information  on  both  issues  for  readers  designing  their  own  localization  studies.  An 
extensive  list  of  references  mentioned  in  the  report  is  provided  in  section  16. 


2.  Basis  of  Auditory  Localization 


The  human  auditory  localization  ability  depends  on  a  number  of  anatomical  and  physiological 
properties  of  the  auditory  system  as  well  as  on  a  number  of  behavioral  factors.  These  properties 
and  behaviors  are  referred  to  in  the  literature  as  localization  cues.  These  cues  are  generally 
classified  as  binaural,  monaural,  dynamic,  and  vision  and  memory  cues.  The  most  important  of 
these  cues  are  the  binaural  cues  that  are  related  to  the  presence  of  two  external  ears  located  on 
opposite  sides  of  the  head^  and  serving  as  the  entry  points  to  the  auditory  system.  This 
configuration  causes  a  sound  coming  at  the  listener  from  an  angle  to  have  a  different  sound 
intensity  and  time  of  arrival  at  each  ear.  Moreover,  individual  anatomic  differences  in  the  size 
and  shape  of  both  the  head  and  external  ears  of  the  listener  affect  the  perceived  direction  of 
incoming  sound  by  creating  a  characteristic  pattern  of  the  diectional  properties  of  the  human 
head  (HRTF,  see  section  2.2)  that  uniquely  modifies  the  spectrum  of  incoming  sound  for  each 
person  (Watanabe  et  ah,  2007).  In  addition  to  the  above  anatomical  cues  and  the  slight  natural 
asymmetry  of  ear  placement  on  the  head  (King,  1999;  Knudsen,  1984),  the  listener’s  movements, 
familiarity  with  the  sound  source,  visibility  of  a  potential  sound  source,  and  expectations  may 


'^Typically,  the  human  ears  are  not  located  at  either  end  of  a  diameter  of  the  head  hut  are  set  hack  hy  about  10“  from  the 
coronal  plane  (Blauert,  1974/2001). 
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affect  the  perceived  direction  of  an  incoming  sound  (e.g.,  Haas,  1951;  Jongkees  and  Groen, 

1946;  Wallach  et  ah,  1949). 

2.1  Binaural  Cues 

2.1.1  History 

The  first  widely  known  study  in  auditory  localization  was  carried  out  by  Venturi  (1796)  who 
walked  around  a  listener  playing  a  note  on  a  flute  at  intervals  and  demonstrated  that  people  could 
point  to  the  direction  from  which  the  sound  of  the  flute  was  coming.  He  attributed  this  capability 
to  sound  intensity  differences  at  each  of  the  two  ears  of  the  listener.  However,  despite  being 
published  in  three  languages,  his  work  did  not  generate  much  interest  among  his  contemporaries. 
Very  little  research  was  done  in  the  area  of  auditory  localization  until  the  last  quarter  of  the  19th 
century,  when  several  authors  experimentally  confirmed  the  importance  of  sound  intensity 
differences  between  the  ears  for  sound  source  localization  (Steinhauser,  1879;  Strutt  [Lord 
Rayleigh],  1876;  Thompson,  1877;  1881).  This  difference  is  caused  by  the  acoustic  shadow  and 
baffle  effects  of  the  head  and  results  in  a  lower  sound  intensity  at  the  ear  located  farther  away 
from  the  sound  source.  However,  the  difference  is  practically  negligible  for  low  frequency 
sounds  below  200  Hz,  and  the  fact  that  these  sounds  can  still  be  localized  baffled  initial 
researchers  studying  auditory  localization. 

Thompson  (1878)  seems  to  have  been  the  first  to  suggest  that  low  frequency  sound  sources  can 
be  localized  on  the  basis  of  sound  phase  differences  between  the  ears.  However,  his  suggestion 
was  rejected  by  his  contemporaries  due  to  the  then  (1863)  popular  theory  that  people  are  “phase 
deaf.”  It  was  not  until  1907,  when  Lord  Rayleigh  experimentally  showed  that  the  direction 
toward  a  low  frequency  sound  source  could  be  determined  on  the  basis  of  the  phase  difference 
between  the  sounds  arriving  at  the  two  ears  that  the  phase  difference  mechanism  of  sound 
localization  was  generally  accepted  (Strutt  [Lord  Rayleigh],  1907).  This  difference  is  caused  by 
the  different  distances  the  sound  has  to  travel  to  each  of  the  ears  and,  in  the  case  of  periodic 
sounds,  can  be  expressed  as  phase  difference. 

Phase  difference  can  also  be  expressed  as  time  difference.  Time  difference  has  a  more  general 
meaning  because  it  can  also  be  applied  to  impulse  and  other  non-periodic  signals.  The  first 
suggestion  that  the  position  of  a  sound  source  can  be  localized  on  the  basis  of  the  difference  in 
the  time  of  arrival  of  the  sound  wave  to  the  two  ears  was  made  by  Mallock  (1908)  and  shortly 
later  corroborated  by  Aggazzotti  (1911),  Hombostel  and  Wertheimer  (1920),  and  Klemm  (1920). 
The  above  phase/time  localization  mechanism  has  been  shown  to  work  well  at  low  frequencies, 
but  for  sounds  at  frequencies  exceeding  about  1.2  kHz  (Middlebrooks  and  Green,  1991),  the 
wavelengths  become  shorter  than  the  distance  between  the  ears  of  the  listener  and  phase 
differences  become  an  ambiguous  cue  (Hartley,  1919;  More  and  Fry,  1907;  Strutt  [Lord 
Rayleigh],  1907;  Wilson  and  Myers,  1908).  This  observation  prompted  Strutt  to  propose  the 
duplex  theory  of  localization,  in  which  phase  differences  and  intensity  differences  are  two 
complementary  localization  mechanisms  allowing  humans  to  localize  low  and  high  frequency 
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sound  sources,  respectively  (Strutt  [Lord  Rayleigh],  1907).  This  theory  was  later  developed  by 
his  followers  (Stevens  and  Newman,  1936).  It  is  important  to  stress  that  although  directional 
perception  has  been  accepted  as  a  3-D  phenomenon  since  the  time  of  Venturi,  most  of  the  early 
research  and  subsequent  theories  of  auditory  localization  have  exclusively  focused  on 
localization  in  the  horizontal  plane. 

2.1.2  Duplex  Theory 

The  two  auditory  mechanisms  comprising  the  duplex  theory  of  localization  are  commonly 
referred  to  in  the  modern  literature  as  the  interaural  intensity  difference  (IID)  and  the  interaural 
time  difference  (ITD)  mechanisms.  In  the  case  of  continuous  pure  tones  and  harmonic 
complexes,  the  term  interaural  phase  difference  (IPD)  is  used  in  place  of  ITD  since  such  sounds 
have  no  clear  reference  point  in  time.  The  IID  and  LTD  (IPD)  together  are  called  the  binaural^ 
localization  cues.  As  discussed,  the  IID  is  the  dominant  localization  cue  for  high  frequency 
sounds,  while  the  ITD  (IPD)  is  the  dominant  cue  for  low  frequency  sounds  (waveform  phase 
difference).  However,  it  was  later  discovered  that  the  ITD  (IPD)  is  also  an  important  cue  in  the 
localization  of  high  frequency  sounds  whose  temporal  envelopes  have  different  onsets  at  the  left 
and  right  ear  (Henning,  1974;  1980;  McFadden  and  Pasanen,  1976;  Zhang  and  Wright,  2007). 
The  resulting  localization  cue  is  frequently  referred  to  as  the  interaural  envelope  difference 
(lED).  In  a  similar  fashion,  the  IID  cues  have  been  found  to  be  important  for  the  localization  of 
low  frequency  sounds  in  the  case  of  near- field  sound  sources  (Brungart  and  Rabinowitz,  1999; 
Brungart  et  ah,  1999;  Shinn-Cunningham  et  ah,  2000).  See  section  2.1.5  for  more  information 
on  the  differences  in  the  localization  of  far-  and  near- field  sound  sources. 

The  transition  zone  between  low  and  high  frequency  binaural  mechanisms  extends 
approximately  from  800  to  1600  Hz.  In  this  region  localization  performance  is  the  poorest 
(Stevens  and  Newman,  1936;  Sandel  et  ah,  1955).  As  regards  the  low  and  high  frequency 
regions,  Langford  (1994)  reported  that  people  who  discriminate  the  low  frequency  ITD  cues  well 
also  discriminate  the  high  frequency  IIL  cues  well,  although  individual  differences  are  large.  The 
mechanisms  of  both  binaural  cues  (IID  and  ITD)  are  shown  in  figure  1. 


^The  term  binaural  was  most  likely  first  used  by  Alison  (1861)  who  used  this  term  to  describe  his  differential  stethophone  and 
later  by  Thompson  (1878)  to  describe  two-ear  phenomena. 
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Figure  1 .  Binaural  localization  cues  caused  by  the 

differences  in  sound  intensity  (stronger  signal  and 
weaker  signal)  and  sound  time  of  arrival  (near  and 
far  ear).  Adapted  from  Shaw  (1974). 

2.1.3  Interaural  Time  Difference  (ITD) 

The  time  difference  (ITD)  is  the  dominant  binaural  cue  for  humans  since  it  is  a  major  cue  for  low 
frequency  sound  source  localization  as  well  as  an  important  secondary  cue  for  high  frequency 
sound  source  localization  (Macpherson  and  Middlebrooks,  2002).  The  ITD  resulting  from  a 
plane  sound  wave  arriving  at  the  near  and  far  ear  of  the  listener  can  be  approximately  calculated 
on  the  basis  of  a  frequency-independent  model  of  a  wave  traveling  around  a  sphere  as 
(Woodworth  and  Schlosberg,  1954) 


ITD  =  -{e  +  sme)  ,  (1) 

c 

where  r  is  the  approximate  radius  of  the  listener’s  head,  6  is  the  angle  between  the  listener’s 
medial  axis  and  the  direction  toward  the  sound  source  (see  figure  1),  and  c  is  speed  of  sound. 

For  angles  0  <  45°,  0  ~  sin  0  (underestimation  error  less  than  5%)  and  equation  1  can  be 
rewritten  as 


ITD  =  sin  e.  (2) 

c 

However,  the  above  frequency-independent  model  of  a  wave  traveling  around  a  sphere  is  only  a 
good  model  of  ITD  at  high  frequencies  (above  3000  Hz),  whereas  at  low  frequencies  the 
diffraction  of  sound  waves  around  the  human  head  causes  longer  ITD.  In  general,  the  ITD  can 
be  calculated  from  the  following  formula  (Kuhn,  1977) 

cir 

/7D  =  — sind,  (3) 

c 
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where  a=3  for  frequencies  below  about  500  Hz  (0  <  90°)  and  gradually  decreases  with  frequency 
to  a=2  for  frequencies  above  2000  Hz  (0  <  60°)^.  In  addition,  ITD  decreases  slightly  with 
temperature  since  the  speed  of  sound  c  =331+0.6r,  where  T  is  the  ambient  temperature  in  °C. 

For  example,  for/<500  Hz,  ^90°,  r=9.0  cm  (Bushby  et  ah,  1992),  and  T  =  15  °C,  the  ITD  is 
794  ps.  This  is  the  greatest  possible  ITD,  also  called  the  critical  ITD,  for  a  listener  with  a  head 
radius  of  9  cm  listening  under  the  stated  conditions.  The  critical  ITD  value  and  the  angular  range 
in  which  ITD  can  be  used  as  a  localization  cue  increase  with  increasing  head  size  but  decrease 
with  increasing  frequency  and  temperature.  Heffner  (2004)  argued  that  the  larger  the  head  size 
the  more  robust  the  binaural  cues  are,  since  in  addition  to  increasing  the  critical  ITD  value  and 
the  angular  range,  a  larger  head  creates  a  greater  acoustic  shadow,  which  in  turn  allows  for  larger 
IIDs. 

In  the  context  of  low  frequency  sound  localization,  it  should  be  noted  that  Savel  (2009)  studied 
the  horizontal  localization  ability  of  50  adult  listeners  using  low-frequency  bands  of  noise  and 
observed  a  frequent  left-hemisphere  advantage  in  localization  accuracy  and  precision  (see 
section  5)  for  right-handed  (vs.  left-handed)  and  male  (vs.  female)  listeners.  She  inferred  that 
this  asymmetry  may  be  related  to  differences  in  brain  organization  and  temporal  processing 
between  the  respective  groups. 

2.1.4  Interaural  Intensity  Difference  (IID) 

It  is  generally  assumed  that  the  diffraction  effect  of  an  average  human  head  becomes  negligible 
below  1  kHz  and  that  at  frequencies  below  1.5  kHz,  the  IID  is  too  small  to  facilitate  sound 
localization.  In  contrast,  the  IID  reaches  10-35  dB  for  high  frequency  sounds  (e.g.,  10  dB  at 
3  kHz  and  35  dB  at  10  kHz)  depending  on  the  lateral  position  of  the  sound  source  and  the  sound 
frequency  (Feddersen  et  ah,  1957;  Kuhn,  1977;  1987;  Mills,  1958;  Middlebrooks  and  Green, 
1991;  Middlebrooks  et  ah,  1989).  Also,  the  IID  effect  across  the  middle  and  high  frequency 
region  has  a  net  effect  of  an  8  dB  improvement  in  signal-to-noise  ratio  when  the  target  sound 
source  and  the  masking  sound  source  are  located  at  opposite  sides  of  the  head  (e.g.,  Bronkhorst, 
2000).  The  general  relationship  between  the  maximum  ITD  and  IID  and  sound  frequency  is 
shown  in  figure  2. 


^The  decrease  in  the  value  of  a  is  nearly  monotonic  except  for  a  small  drop  to  about  a=1.7  over  the  1400-1600  Hz  frequency 
range.  This  minimum  a  value  (and  the  corresponding  ITD  value)  occurs  precisely  over  the  same  frequency  range  as  where 
listeners  exhibit  the  poorest  localization  discrimination  (Mills,  1958). 
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Frequency  (Hz) 


Figure  2.  General  dependence  of  ITD  (dashed  line,  left  scale) 

and  IID  (solid  line,  right  scale)  on  frequency.  Adapted 
from  Gulick  et  al.  (1989). 

The  general  monotonic  relationship  between  the  IID  and  the  azimuth  angle  is  only  the  first 
approximation  of  the  actual  relationship.  Due  to  the  physics  of  wave  diffraction  around  the  head 
(Kuhn,  1977;  1987),  the  maximum  IID  appears  not  at  90°  but  at  a  smaller  angle,  making  the 
relationship  between  the  IID  and  azimuth  angle  non-monotonic.  However,  the  higher  the 
frequency,  the  higher  the  IID  and  the  larger  the  angle  at  which  the  IID  reaches  its  maximum 
(Macaulay  et  al.,  2010).  Thus  as  frequency  increases,  the  angle  of  maximum  IID  approaches  90° 
and  the  non-monotonicity  is  gradually  reduced.  The  non-monotonic  behavior  of  the  IID  does 
cause  large  localization  uncertainty  for  mid-high  frequency  tones  (1000-1600  Hz)  that  arrive 
from  locations  more  than  30^0°  off  the  midline  (Firestone,  1930;  Macaulay  et  al.,  2010;  Mills, 
1958;  Nordlund,  1962ab). 

2.1.5  Far-field  and  Near-field 

In  an  open  field  and  for  a  sound  source  far  away  from  the  listener’s  head,  both  ITDs  and  IIDs  are 
independent  of  the  distance  between  the  sound  source  and  the  listener.  However,  as  the  distance 
between  the  sound  source  and  the  listener  decreases,  the  difference  between  the  sound  intensities 
reaching  the  listener’s  left  and  right  ear  increases,  the  acoustic  shadow  behind  the  listener’s  head 
grows  larger,  and  the  curvature  of  the  sound  field  increases^  (Brungart  and  Rabinowitz,  1996). 
These  effects  cause  the  IID  to  gradually  increase  and  become  dependent  on  the  distance  between 
the  listener  and  the  sound  source. 

The  region  in  which  IIDs  are  independent  of  the  distance  between  the  sound  source  and  the 
listener  is  referred  to  in  the  localization  literature  as  the,  far  field,  and  the  region  in  which  they  are 
distance-dependent  is  called  the  near  field  of  the  head.  The  near  field  is  generally  assumed  to 
extend  up  to  five  times  the  radius  of  the  head  (or  about  0. 5-1.0  m)  away  from  the  center  of  the 


n 

'The  increase  in  the  curvature  of  the  sound  field  is  due  to  the  fact  that  at  short  distances  from  the  sound  source,  the  plane 
wave  approximation  of  the  wave  front  is  no  longer  valid. 
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listener’s  head  (Brungart,  1999;  Duda  and  Martens,  1998),  depending  on  the  size  of  the  head. 

The  relation  between  distance  and  IID  in  the  near  field  is  dependent  of  both  the  azimuth  angle 
and  sound  source  frequency  (spectrum).  For  example,  for  a  sound  source  emitting  a  500-Hz  tone 
and  located  at  a  90°  angle  to  the  listener,  the  far-field  IID  at  I  m  distance  is  about  3  dB  and  the 
near-field  IID  at  20  cm  is  as  large  as  13  dB  (Brungart  and  Rabinowitz,  1996).  Therefore,  as  a 
result  of  the  increased  IID  in  the  near  field,  the  perceived  location  of  the  sound  source  is  being 
shifted  laterally. 

Similarly,  the  changes  in  the  IID  with  the  changes  in  the  distance  between  the  listener  and  the 
sound  source  affect  the  listener’s  judgment  of  the  actual  distance  to  the  sound  source  making  it 
actually  more  accurate  than  in  far  field,  especially  for  sound  sources  located  at  the  lateral 
directions  (Brungart,  1998).  In  contrast,  to  the  IID  changes,  the  ITD  remains  relatively 
independent  of  distance  in  the  near  field  and  its  small  changes  do  not  affect  distance  perception 
(Brungart,  1998;  Duda  and  Martens,  1998).  Brungart  (1998)  measured  the  compound 
localization  error  (see  section  5)  in  the  3-D  space  in  proximity  of  the  listener’s  head  and  reported 
an  average  error  of  16.5°.  This  error  is  similar  in  size  to  the  average  far  field  compound 
localization  error  (21.1°)  reported  by  Wightman  and  Kistler  (1989a)  indicating  similar 
localization  accuracy  in  both  far  field  and  near  field.  However,  the  number  of  reversal  errors^ 
reported  by  Brungart  (1998)  was  noticeably  larger  (16.4%)  than  reported  in  far  field  studies 
(2%-ll%). 

2.1.6  Limitations  of  Binaural  Cues 

Many  experimental  studies  have  confirmed  that  binaural  cues  are  the  main  localization 
mechanisms  in  the  horizontal  plane.  The  ITD  provides  left-right  localization  cues  at  low 
frequencies,  below  -800  Hz,  and  the  IID  provides  left-right  localization  cues  at  high  frequencies, 
above  -1600  Hz.  In  the  800-1600  Hz  range  neither  individual  binaural  cue  is  particularly 
effective,  but  working  in  tandem  they  provide  somewhat  more  effective  than  each  of  them 
individually  localization  capability. 

If  one  assumes  that  both  the  ITD  and  IID  cues  are  equally  effective  across  their  optimum 
frequency  ranges,  then  the  low-  and  high-frequency  parts  of  a  given  sound  spectrum  should  be 
equally  localizable.  A  frequency  that  divides  the  sound  spectrum  into  two  parts  that  are  “equal” 
with  respect  to  some  specific  criterion  (such  as  localizability)  is  sometimes  referred  to  as  the 
center  of  gravity  of  the  sound  spectrum.  In  the  case  of  localizability,  the  crossover  frequency  for 
ITD  and  IID  cues,  say  1200  Hz,  does  not  exhibit  this  center  of  gravity  property,  that  is,  the  part 
of  a  sound  below  1200  Hz  is  not  localized  just  as  well  as  the  part  above  1200  Hz.  King  and 
Oldfield  (1997)  reported  that  for  the  three  subjects  they  tested,  the  center  of  gravity  was  in  the 
8-9  kHz  range.  This  supports  the  general  observation  that  high  frequency  sounds  are  localized 


^Reversal  errors  are  discussed  in  section  10. 
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more  effectively  than  low  frequency  sounds  and  that  the  localization  effectiveness  of  IID  cues  is 
superior  to  that  of  ITD  cues. 

Despite  their  great  role  in  horizontal  localization  the  binaural  cues  are  only  marginally  useful  for 
vertical  localization  or  front-back  differentiation.  This  is  due  to  the  spatial  ambiguity  caused  by 
left-right  head  symmetry  commonly  referred  to  as  the  cone  of  confusion  (Wallach,  1939).  The 
cone  of  confusion  is  the  imaginary  cone  extending  outward  from  each  ear  along  the  interaural 
axis  and  representing  sound  source  locations  producing  the  same  interaural  differences.  The 
concept  of  cone  of  confusion  is  shown  in  figure  3. 


In  general,  sound  source  locations  on  the  surface  of  the  cone  of  confusion  cannot  be  identified 
using  binaural  cues,  although  asymmetry  in  ear  placement  on  the  head  and  in  the  shape  of  the 
pinnae  provides  some  disambiguation.  Nonetheless,  in  order  to  reliably  differentiate  between 
specific  positions  on  the  surface  of  the  cone  of  confusion,  other  cues  are  needed.  These  cues  are 
called  monaural  cues  as  they  do  not  depend  on  the  presence  of  two  ears. 

2.2  Monaural  Cues 

Monaural  cues  result  from  sound  energy  absorption  by  the  head,  shadowing  and  baffle  effects  of 
the  outer  ear  (pinna)^,  and  sound  reflections  caused  by  the  outer  ear,  head,  and  shoulders 
(Batteau,  1967;  Bloch,  1893;  Gardner  and  Gardner,  1973;  Lopez-Poveda  and  Meddis,  1996; 
Mach,  1906/1959;  Musicant  and  Butler,  1984;  Steinhauser,  1879).  Even  the  presence  (or  lack) 
of  hair  and  hair  arrangement  may  affect  monaural  cues  (Treeby  et  ah,  2007).  All  these  physical 
effects  result  in  spectral  changes  in  the  sounds  arriving  at  the  ears  and  are,  therefore,  often 
referred  to  as  monaural  spectral  cues.  Acoustic  shadowing  occurs  when  a  sound  wave  is 
reflected  by  an  encountered  object,  causing  an  acoustic  shadow  behind  the  object.  In  the  case  of 
the  human  head,  this  is  particularly  noticeable  for  frequencies  above  1  kHz  (e.g..  Mills  1972). 
The  baffle  effect  is  an  increase  in  sound  pressure  in  front  of  an  object  due  to  the  reflected  energy. 


^Bloch  (1893)  seems  to  be  the  first  one  to  demonstrate  that  changes  in  the  shape  of  the  pinna  results  in  changes  in  the 
perceived  locations  of  sound  sources. 
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Both  of  these  effects  and  the  specific  reflections  from  the  different  parts  of  the  pinna,  head,  and 
torso  produce  peaks  and  troughs  in  the  sound  spectrum  that  are  unique  for  each  sound  source 
location  in  space^®  relative  to  the  position  of  the  listener  (Bloom,  1977;  Butler  and  Belendiuk, 
1977;  Watkins,  1978).  Reflections  from  the  torso  (shoulders)  affect  sounds  in  the  frequency 
range  of  2-3  kHz  (e.g.,  Algazi  et  ah,  2001;  Gardner,  1973),  while  pinna  effects  are  most 
pronounced  above  3-4  kHz  (e.g.,  Roffler  and  Butler,  1968b).  This  means  that  monaural  cues 
generated  by  pinnae  and  body  reflections  are  high  frequency  needed  for  accurate  sound  source 
localization  (e.g.,  Butler,  1975).  The  absence  of  pinna  cues  (e.g.,  by  filling  the  concaves  of  the 
pinna)  greatly  decreases  localization  accuracy  (Gardner  and  Gardner,  1973;  Oldfield  and  Parker, 
1984b;  1986;  Roffler  and  Butler,  1968b)  and  destroys  the  “outside-of-the-head”  spatial 
impression  (Plenge,  1974).  Physical  differences  between  the  left  and  right  pinnae  and  the  overall 
left-right  asymmetry  of  the  human  body  also  generate  interaural  spectrum  differences  (ISDs), 
which  additionally  differentiate  the  sounds  entering  each  ear.  Further,  the  lateral  part  of  the 
human  ear  canal  is  slated  about  15°  upwards  while  the  medial  part  of  the  canal  is  slanted 
downwards,  providing  potentially  additional  mechanism  for  differentiating  sounds  coming  from 
above  and  from  below  (Shaw,  1996).  These  differences  create  additional  spectral  cues  that  are 
related  to  the  monaural  cues  and  aid  localization  in  the  horizontal  plane  (Searle  et  ah,  1975; 
Shaw,  1974;  1982). 

A  number  of  studies  demonstrated  that  people  listening  with  just  one  ear  can  localize  sound 
sources  in  the  horizontal  plane  although  such  localization  accuracy  is  much  poorer  than  with  two 
ears  and  all  localization  judgments  are  shifted  toward  the  active  ear  (Belendiuk  and  Butler,  1975; 
Butler  1987;  Butler  and  Flannery,  1980;  Butler  and  Naunton,  1967;  Jin  et  ah,  2004;  Morimoto, 
2001;  Oldfield  and  Parker,  1984a;  1986;  Van  Wanrooij  and  Van  Opstal,  2004).  Such  localization 
ability  is  proof  that  horizontal  localization  can,  to  some  degree,  be  facilitated  by  monaural  cues. 
In  this  case,  the  emitted  sound  must  contain  energy  above  ~5  kHz,  that  is,  in  the  frequency  range 
where  the  pinna  cues  have  an  appreciable  role.  Batteau  (1967)  and  Fisher  and  Freedman  (1968) 
seem  to  attribute  the  monaural  localization  ability  to  a  sequence  of  time-delayed  reflections  from 
the  pinnae  surfaces.  However,  it  is  unclear  to  what  extent  this  mechanism  is  helpful  when 
binaural  cues  are  present.  Macpherson  and  Middlebrooks  (2000,  p.  2233)  asserted  that  in  this 
situation  “monaural  spectral  cues  had  little  or  no  influence  on  perceived  lateral  angle.” 

Monaural  spectral  cues  and  the  related  interaural  spectral  cues  help  the  binaural  cues  resolve 
sound  source  laterality,  but  they  are  most  critical  for  vertical  localization  and  front-back 
differentiation  (e.g.,  Blauert  1974/2001;  Gardner  and  Gardner,  1973;  Oldfield  and  Parker, 

1984b).  The  relative  importance  of  the  interaural  spectral  cues  to  the  localization  of  sound 
sources  at  different  elevations  is  hard  to  generalize  since  it  varies  with  the  lateral  position  of  the 
sound  source  (e.g.,  Jin  et  ah,  2004).  Oldfield  and  Parker  (1986)  demonstrated  that  monaural 
localization  in  the  vertical  plane,  which  does  not  take  advantage  of  the  interaural  spectral  cues,  is 

^®The  differences  in  monaural  cues  are  much  greater  in  vertical  plane  than  in  horizontal  plane  where  they  are  much  weaker 
than  the  corresponding  differences  in  the  binaural  cues. 
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relatively  good  but  somewhat  less  accurate  than  binaural  localization.  Similar  data  were  reported 
by  Humanski  and  Butler  (1988)  and  Slattery  and  Middlebrooks  (1994).  However,  the  results  of 
such  monaural  studies  are  hard  to  interpret  since  “monaural  listening  actually  provides 
conflicting  and  unnatural  cues  to  sound  source  position  [and]  one  cannot  be  certain  that  the 
listener’s  judgments  of  apparent  sound  source  position  will  reflect  only  the  influence  of  spectral 
cues”  (Wightman  and  Kistler,  1997,  p.  1061). 

The  spectral  cues  that  are  the  most  important  for  accurate  front-back  and  up-down  differentiation 
are  located  in  the  4-16  and  6-12  kHz  frequency  range,  respectively  (e.g.,  Langendijk  and 
Bronkhorst,  2002a).  Hebrank  and  colleagues  (Hebrank  and  Wright,  1974;  Wright  et  ah,  1974) 
identified  the  major  monaural  cues  in  the  median  plane  as  notch  (Nl)  between  4  and  8  kHz  (front 
cue),  peak  (PI)  between  7  and  9  kHz  (overhead  cue),  and  peak  (P2)  between  10  and  12  kHz 
(back  cue)  in  the  sound  spectrum.  Langendijk  and  Bronkhorst  (2002b)  confirmed  that  these  two 
peaks  and  the  notch  are  sufficient  to  obtain  realistic  virtual  sounds  in  a  3-D  space.  The  overall 
envelopes  of  the  sound  spectra  recorded  in  the  ear  canal  are  relatively  similar  across  people,  but 
the  major  peaks  and  notches  have  a  tendency  to  be  shifted  down  for  people  of  larger  size 
(Middlebrooks  et  ah,  1989).  Asano  et  al.  (1990),  Butler  and  Humanski  (1992),  and  Algazi  et  al. 
(2001) — but  not  Morimoto  et  al.  (2003) — reported  that  in  addition  to  high  frequency  monaural 
cues,  the  low  frequency  (<2  kHz)  cues  may  also  be  important  for  front-back  differentiation  and 
vertical  localization,  especially  for  elevations  exceeding  45°,  where  the  monaural  high  frequency 
cues  become  less  effective.  This  effect  may  be  due  to  the  asymmetrical  locations  of  pinnae  on 
the  head  surface  and  to  elevation-dependent  low-frequency  sound  modifications  caused  by  head 
diffraction  and  torso  reflections  (e.g.,  Gardner,  1973;  Genuit  and  Platte,  1981;  Kuhn,  1987). 
These  modifications  are  small  for  sound  sources  located  in  the  median  plane,  but  they  gradually 
become  more  pronounced  at  larger  azimuth  angles,  that  is,  at  angles  away  from  the  median  plane 
(e.g.,  Algazi  et  al.,  2001).  This  dependence  may  explain  the  poor  localization  of  low  frequency 
sound  sources  located  in  the  median  plane  reported  by  Morimoto  et  al.,  (2003). 

Both  the  binaural  and  monaural  cues  are  unique  properties  of  each  individual  person  due  to  the 
unique  anatomic  features  of  each  person’s  head.  These  anatomical  differences  are  reflected  in 
the  pattern  of  the  head  related  transfer  functions  (HRTFs)  of  each  person’s  head.  An  HRTF  is  a 
frequency-dependent  transfer  function  between  sound  source  location  in  space  and  the  point  at 
the  entrance  to  the  listener’s  ear  canal.  A  pair  of  such  functions,  for  the  left  and  right  ear, 
uniquely  represents  the  location  of  a  sound  source  in  the  space  as  heard  by  a  given  listener 
(Watanabe  et  al.  2007).  These  functions  are,  in  general,  not  transferable  between  individuals  and 
are  most  different  for  frequencies  in  the  high  frequency  region  of  5-10  kHz,  where  the  pinna 
contributions  are  the  largest.  The  maxima  and  notches  in  the  HRTF  pattern  can  be  as  large  as 
25  dB  (e.g..  Mills,  1972;  Wightman  and  Kistler,  1989a),  and  their  size  and  distribution  depend 
on  the  monaural  cues  and  the  slight  natural  asymmetry  in  ear  placement  on  the  head  (e.g..  King, 
1999;  Knudsen,  1984). 
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The  differences  in  individual  HRTFs  create  the  cues  that  each  person  learns  during  their  lifetime. 
This  is  the  reason  why  people  who  do  not  seem  to  differ  in  localizing  real-world  sounds  may 
differ  quite  dramatically  when  exposed  to  the  same  AVR  environment  preprocessed  through 
somebody  else’s  HRTF.  Wenzel  et  al.  (1993)  demonstrated  that  the  rate  of  localization  error  for 
an  AVR  in  which  individually  measured  HRTFs  (individualized  HRTFs)  are  used  is  much  lower 
than  for  an  AVR  based  on  a  non-individualized  HRTF  (i.e.,  an  average  HRTF  or  HRTF  from  a 
representative  listener).  However,  it  should  be  noted  that  not  all  the  details  of  an  individual 
HRTF  need  to  be  captured  exactly  in  order  to  preserve  the  natural  locations  of  sound  sources. 
Kulkami  and  Colburn  (1998)  studied  the  effect  of  spectral  smoothing  of  HRTFs  and 
demonstrated  that  “crude  approximations  of  the  natural  ear-input  signals  were  perceived  as 
natural  provided  that  these  waveform  were  made  to  change  in  a  manner  consistent  with  the 
movement  of  the  listener  head”  (p.  748).  Pulkki  (2001)  hypothesized  that  good  localization  in 
virtual  space  is  dependent  on  the  preservation  of  the  pattern  of  pinna-mode  frequencies  rather 
than  the  specific  details  of  peaks  and  notches.  This  means  that  if  the  specific  frequencies  of 
spectral  peaks  and  notches  are  preserved,  the  relative  sizes  of  the  peaks  and  notches  are  not 
critical. 

2.3  Dynamic  Cues 
2.3.1  Head  Movements 

In  addition  to  binaural  and  monaural  cues,  spatial  localization  ability  in  both  the  horizontal  and 
vertical  planes  is  also  dependent  on  head  movements,  which  cause  momentary  changes  in  the 
peak-and-trough  pattern  of  the  sound  spectrum  at  each  ear  (e.g.,  Fisher  and  Freedman,  1968; 
Iwaya  et  al.,  2003;  Jongkees  and  Veer,  1958;  Lambert,  1974;  Ohtsubo  et  al.,  1980;  Perrett  and 
Noble,  1997a;  Thurlow  and  Runge,  1967;  Thurlow  et  al.,  1967;  Wallach,  1940;  Young,  1931). 
These  dynamic  cues  are  the  most  important  for  low  frequency  sounds  below  2  kHz  (Thurlow  and 
Mergener,  1970).  Most  authors  report  much  larger  localization  errors  when  the  listener’s  head  is 
fixed  than  when  the  listener  is  allowed  to  turn  his  head  toward  the  source  of  sound  (e.g..  Link 
and  Lehnhardt,  1966)  and  several  authors  consider  head  movements  as  the  most  essential 
mechanism  in  solving  front-back  uncertainty  (originally  proposed  as  such  by  Van  Soest,  1929, 
and  later  corroborated  by  Borger  et  al.,  1977;  DiCarlo  and  Brown,  1960;  Mackensen,  2003; 
Majdak  et  al.,  2010;  Nordlund,  1962ab;  Wallach,  1939;  and  Wightman  and  Kistler,  1999). 
Thurlow  et  al.  (1967)  studied  the  localization  performance  of  listeners  who  were  allowed  to 
move  their  heads  while  keeping  their  torso  straight.  They  observed  that  the  listeners  usually 
moved  their  head  back  and  forth  more  than  once  and  that  most  head  movements  were  small 
horizontal  rotations.  If  the  sound  is  long  enough  (600-800  ms),  such  movements  of  the  head 
allow  the  listener  to  disambiguate  front-back  confusions  and  focus  on  the  direction  of  the 
incoming  sound  (e.g.,  Iwaya  et  al.,  2003;  Lambert,  1974;  Noble,  1987;  Perrett  and  Noble,  1997a; 
Rakerd  and  Hartmann,  1986;  Thurlow  and  Runge,  1967).  For  the  same  reason,  a  train  of 
repeated  pulses  results  in  better  auditory  localization  of  the  sound  source  than  a  single  pulse 
(Macpherson  and  Middlebrooks,  2000).  In  general,  the  effects  of  pinna  cues  and  head 
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movements  seem  to  be  additive  for  sound  source  localization  in  the  horizontal  plane.  The 
absence  of  one  or  the  other  results  in  a  similar  loss  of  localization  acuity  and  a  similar  change  in 
error  pattern  (Muller  and  Bovet,  1999).  However,  it  needs  to  be  added  that  head  movements 
may  also  result  in  localization  errors.  Such  negative  effects  of  head  movements  may  be  observed 
if  a  short  sound  stimulus  is  heard  during  a  rapid  head  movement  (e.g.,  Cooper  et  ah,  2008). 

Wallach  (1939,  1940)  hypothesized  that  small  head  movements  in  the  horizontal  plane  should 
also  help  resolve  the  sound  source  position  in  the  vertical  plane.  Wallach  argued  that  the 
horizontal  rotation  of  the  head  should  eliminate  front-back  errors  by  changing  and  contrasting 
the  interaural  differences  (especially  ITDs)  caused  by  sound  sources  located  in  the  front  or  rear. 
For  a  given  range  of  head  rotations,  these  changes  would  be  the  greatest  for  horizontal  locations, 
nonexistent  for  vertical  locations  (+90°),  and  intermediate  for  partially  elevated  sound  source 
locations.  Therefore,  these  rotations  should  also  allow  some  degree  of  discrimination  of  the 
sound  source’s  vertical  displacement.  The  resulting  cue,  referred  to  as  Wallach  cue,  depends  on 
the  presence  of  low  frequency  (below  2  kHz)  energy  in  the  signal  and  is  most  effective  for  sound 
sources  located  in  the  upper  front  of  the  median  plane  (Perrett  and  Noble,  1997a).  The  Wallach 
cue  seems  to  serve  as  a  secondary  cue  for  vertical  localization,  and  if  the  monaural  pinna  cues 
are  sufficiently  strong,  its  presence  does  not  noticeably  improve  vertical  localization 
performance  (although  it  is  still  important  for  resolving  front-back  uncertainty). 

Other  head  or  body  movements  that  affect  localization  performance  are  tipping  the  chin  toward 
the  chest,  tilting  the  body,  or  pivoting  the  head  toward  one  or  the  other  shoulder.  While  such 
movements  may  help  to  determine  the  degree  of  elevation  of  the  sound  source  (Perrett  and 
Noble,  1997a),  they  progressively  displace  the  apparent  midline  in  the  direction  opposite  to  the 
direction  of  the  movement  and  affect  both  localization  performance  in  the  horizontal  plane  and 
localization  of  the  sound  source  located  just  above  the  listener’s  head  (e.g.,  Comalli  and 
Altshuler,  1971;  Teubert  and  Liebert,  1956).  Therefore,  it  is  very  important  that  listeners 
participating  in  localization  studies  are  always  reminded  to  keep  their  head  straight,  even  if  small 
rotational  movements  are  permitted. 

While  modem  studies  mostly  employ  very  short  sounds,  sounds  as  long  as  3-A  s  were  used  in 
older  studies,  and  the  effects  of  head  movements  were  easier  to  observe  (e.g.,  Angell,  1903; 
Thurlow  and  Mergener,  1970).  It  seems  that  a  minimum  duration  of  600-800  ms  is  needed  to 
accommodate  the  effects  of  head  movements.  For  example.  Noble  (1990)  observed  that  head 
movements  had  minimal  effect  on  the  localization  of  a  500-ms  sound  but  caused  a  considerable 
improvement  in  localization  performance  when  the  sound  duration  increased  to  1.5  s.  Similar 
data  were  reported  by  Thurlow  and  Mergener  (1970). 

Regardless  of  the  presence  or  lack  of  head  movements,  the  sound  event  may  need  to  be  of  a 
certain  duration  to  allow  the  listener  to  build  a  spatial  image  of  the  location  of  the  sound  source 
(e.g.,  Blauert,  1974/2001;  Burger,  1958;  Kietz,  1953).  For  example.  Pollack  and  Rose  (1967) 
observed  that  with  no  head  movements,  changing  the  signal  duration  from  3  ms  to  1  s  reduced 


17 


the  average  localization  error  from  10°  to  2°.  Tobias  and  Zerlin  (1957;  1959)  studied  the  effect 
of  stimulus  duration  on  lateralization  threshold,  which  is  the  smallest  noticeable  change  in  sound 
source  lateralization  within  the  head,  using  noise  bursts  with  durations  from  10  ms  to  1.9  s.  They 
concluded  that  for  sound  duration  up  to  700  ms,  the  threshold  varied  systematically  with  the 
stimulus  duration  and  became  duration  independent  above  700  ms.  These  reports  indicate  that 
sound  duration  affects  localization  performance  beyond  just  allowing  head  movements.  It  is 
noteworthy  that  the  duration  above  which  head  movements  meaningfully  contribute  to  front- 
back  disambiguation  also  coincides  with  the  perceptual  boundary  between  short  and  long  sounds 
in  the  perception  of  music  (450-900  ms  [Clarke,  1999;  Fraisse,  1978)]).  A  longer  duration  also 
permits  the  listener  to  recognize  familiar  sounds  (see  section  2.5).  In  one  notable  study.  Noble 
and  Gates  (1985)  allowed  the  listeners  to  move  their  head  and  body  (while  remaining  seated)  and 
control  the  duration  of  the  presented  stimuli.  They  reported  far  better  localization  accuracy  than 
was  earlier  reported  by  Roffler  and  Butler  (1968b),  who  used  similar  signals  but  restricted  the 
listener’s  movements. 

2.3.2  Sound  Onset  and  Precedence  Effect 

Another  kind  of  dynamic  cue,  this  time  related  to  the  signal  as  opposed  to  the  listener,  is  the 
temporal  envelope  of  the  auditory  signal.  An  important  property  of  the  auditory  system  is  that  it 
primarily  reacts  to  the  onset  of  a  sound  event  (and  to  some  degree  its  offset)  while  suppressing 
the  effects  of  the  steady-state  part  of  the  sound  (Stecker  and  Hafter,  2002).  Both  the  sound 
identification  and  sound  source  localization  abilities  of  the  listener  depend  greatly  on  the  form 
and  duration  of  the  sound  onset,  especially  in  enclosed  spaces  (e.g.,  Elfner  and  Tomsic,  1968; 
Rakerd  and  Hartmann,  1986).  According  to  Wilska  (1938;  table  5)  tones  in  the  frequency  range 
of  400-6400  Hz  with  on-  and  off-set  durations  <1  ms  can  be  localized  with  less  that  3°  error 
across  the  whole  frequency  range,  while  100  ms  on-  and  off-set  durations  lead  to  localization 
errors  ranging  from  5°  to  15°  with  increasing  tone  frequency. 

The  importance  of  the  front-end  of  the  arriving  waveform  for  sound  source  localization  has  been 
termed  the  precedence  effect  (Wallach  et  ah,  1949;  Litovsky  et  ah,  1999),  Haas  effect  (Haas, 
1951),  or  the  law  of  the  first  wavefront  (Cremer,  1948).  Historical  background  of  this  effect 
going  back  to  works  of  Henry  (1851;  1856),  Fay  (1936),  and  Hall  (1936)  can  be  found  in 
Gardner  (1968).  According  to  this  law,  the  listeners  make  localization  judgments  based  on  the 
earliest  arriving  sound,  ignoring  any  other  similar  sounds  arriving  from  other  directions  (e.g., 
reflections  of  the  primary  sound  from  the  walls  in  a  closed  space).  If  the  secondary  sound  is 
delayed  by  1  to  20  ms  and  has  an  intensity  not  exceeding  the  intensity  of  the  primary  sound  by 
more  than  10  dB,  only  one  sound  is  heard,  and  that  sound  is  the  primary  sound^i.  If  the 
secondary  sound  is  delayed  by  less  than  1  ms,  it  is  perceptually  integrated  with  the  primary 
sound,  and  the  integrated  sound  is  head  as  arriving  from  a  direction  that  is  the  average  of  both 
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^^The  precedence  effect  is  a  binaural  effect  and  exists  only  in  a  real  sound  field.  Green  (1976)  demonstrated  that  while  a 
6-ms  time  delay  between  two  identical  pulses  cannot  be  heard  in  a  room,  it  can  easily  be  heard  with  one  ear  over  an  earphone. 
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directions  (Hartmann,  1997;  Shinn-Cunningham  et  al.,  1993).  If  the  secondary  sound  arrives 
after  a  delay  longer  than  20  ms,  it  is  heard  as  an  echo.  The  precedence  effect  causes  some 
counterintuitive  spatial  effects  such  as  the  Franssen  Effect^^  (Franssen,  1960)  and  the  Clifton 
Effecti3  (Clifton  1987). 

An  audible  sound  onset  and  the  existence  of  the  precedence  effect  are  the  main  reasons  that  we 
can  localize  sound  sources  even  in  reverberant  environments  with  multiple  reflective  surfaces  as 
long  as  we  hear  the  beginning  of  the  primary  sound.  This  is  also  why  acoustic  sources 
generating  impulse  sounds  (e.g.,  firearms)  are  easier  to  localize  than  sources  emitting  continuous 
or  slowly  rising  long  sounds.  This  effect  of  sound  envelope  supports  the  notion  that  short 
impulses  (5-2000  ms)  with  onset  time  <5  ms  are  the  easiest  sounds  to  localize  in  closed  spaces 
(e.g.,  Christian  and  Roser,  1957;  Hartmann,  1983a;  Earoche,  1994). 

In  closed  spaces,  reflected  sounds  add  to  the  reverberant  character  of  the  perceived  sound  but  are 
not  heard  separately.  As  Hartmann  (1997)  pointed  out,  if  the  reverberation  is  not  too  excessive, 
we  frequently  do  not  realize  its  presence  until  we  hear  a  recording  of  the  sound  in  a  given  space 
played  in  reverse.  Eocalization  acuity  for  a  leading-lagging  pair  of  sounds  is  almost  as  good  as 
for  a  single  sound  source  with  a  slight  displacement  toward  the  direction  of  the  lagging  stimulus 
(Zurek,  1980;  Eitovsky  and  Macmillan,  1994).  However,  it  is  important  to  stress  that  using  the 
precedence  effect  and  the  ability  to  localize  a  single  sound  source  are  different  phenomena. 

While  normal  infants  can  localize  a  single  sound  source  soon  after  birth,  they  must  learn  to  use 
the  precedence  effect,  which  generally  occurs  after  6  months  of  postnatal  cortical  development. 
Similarly,  unilateral  ablation  of  the  auditory  cortex  in  cats  disrupts  the  precedence  effect  but  does 
not  affect  the  localization  accuracy  of  a  single  sound  source.  See  Hartmann  (1997)  and  Zurek 
(1987)  for  more  information. 

2.4  Vision  and  Memory  Cues 

Other  potential  localization  cues  include  visual  cues  (e.g.,  Eackner,  1973;  Wallach,  1939), 
vestibular  cues  (discussed  in  section  7)  (e.g.,  Meurman  and  Meurman,  1954;  Wallach,  1939), 
prior  knowledge  of  the  stimulus  (e.g.,  Angell  and  Fite,  1901ab;  Kietz,  1953;  Pierce,  1901; 

Rogers  and  Butler,  1992),  and  the  listener’s  expectations.  These  cues  are  termed  in  this  report  as 
vision  and  memory  cues. 


^^The  Franssen  Effect  is  an  auditory  localization  illusion  in  which  the  listener  incorrectly  identifies  the  sound  source  emitting 
the  sound.  It  can  be  demonstrated  by  placing  two  loudspeakers  (1  and  2)  in  a  room  at  a  certain  distance  apart.  At  the  beginning 
of  the  demonstration  a  pure  tone  abruptly  begins  to  be  emitted  from  loudspeaker  1.  After  some  time  the  signal  is  gradually  faded 
over  from  loudspeaker  1  to  loudspeaker  2  keeping  the  total  signal  power  constant.  At  the  end  of  the  fading  phase,  the  pure  tone  is 
only  emitted  from  loudspeaker  2,  yet  the  listener  still  localizes  loudspeaker  1  as  its  source.  A  good  discussion  of  the  Franssen 
Effect  can  be  found  elsewhere  (Hartmann  and  Rakerd,  1989b). 

^^The  Clifton  Effect  can  be  demonstrated  by  emitting  a  series  of  clicks  from  two  loudspeakers,  one  loudspeaker  emitting  the 
primary  (strong)  clicks  and  the  other  emitting  the  secondary  (weak)  clicks  with  a  10-ms  delay  (three  click  pairs  per  second).  An 
abrupt  reversal  of  the  directions  from  which  the  two  clicks  come  from  renders  both  sound  sources  temporarily  audible,  but  after  a 
few  more  repetitions  the  source  of  the  lagging  (weaker)  clicks  “disappears”  again. 
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Many  observations  indicate  that  visual  perception  dominates  auditory  perception  with  respect  to 
localization  and  that  people  have  a  tendency  to  trust  their  eyes  more  than  their  ears  (Ghirardelli 
and  Scharine,  2009;  p.  605).  This,  to  some  extent,  can  be  due  to  the  fact  that  the  auditory 
localization  ability  is  less  acute  than  the  visual  localization  ability.  This  difference  is  more 
dominant  in  the  vertical  plane,  where  listeners  consistently  tend  to  underestimate  the  elevation  of 
the  sound  source,  than  in  the  horizontal  plane  (Dobreva  et  ah,  2005).  Heffner  and  Heffner  (1992) 
hypothesized  that  the  relatively  poor  acuity  of  the  auditory  localization  system  in  comparison  to 
visual  localization  system  may  be  due  to  the  fact  that  its  main  role  is  to  direct  vision  toward  the 
sound  source  rather  than  to  be  a  discriminative  system  on  its  own  (Heffner  and  Heffner,  1992). 
This  seems  to  be  supported  by  the  fact  that  the  acuity  of  auditory  localization  among  various 
species  is  inversely  proportional  to  the  width  of  the  field  of  best  vision  (Heffner,  2004).  See  also 
section  4. 

When  a  person  sees  a  sound  source,  their  auditory  localization  acuity  artificially  increases  by 
pointing  toward  the  visual  object  (Shelton  and  Searle,  1980;  Stein  et  ah,  1989;  Godfrey  and 
Roumes,  2004).  Even  more  importantly,  if  a  person  sees  an  object  that  could  be  the  source  of  an 
arriving  sound,  they  may  frequently  select  this  object  as  the  source  regardless  of  whether  this 
object  actually  produced  the  sound  or  not  (Jackson,  1953;  Warren,  1970).  In  general,  if  vision 
and  hearing  report  conflicting  information,  vision  almost  always  dominates  hearing.  This 
phenomenon  has  been  termed  the  capture  effect  (e.g.,  Ghirardelli  &  Scharine,  2009).  The  most 
widely  known  form  of  the  capture  effect  is  the  ventriloquism  effect  (VE)  (Howard  and 
Templeton,  1966)  in  which  the  listener  perceives  the  ventriloquist’s  speech  as  coming  from 
ventriloquist’s  dummy.  The  visual  capture  effect  is  very  strong  when  the  angular  difference  in 
position  between  the  visual  object  and  the  sound  source  is  less  than  30°,  although  Thurlow  and 
Jack  (1973a)  reported  some  listeners  had  confusion  for  angles  as  large  as  60°.  The  closer  the 
visual  target  is  to  the  midline,  the  more  likely  the  capture  effect  (Hairston  et  ah,  2003). 

In  contrast  to  the  effects  caused  by  the  visible  sound  sources,  it  is  not  entirely  clear  whether 
simply  the  presence  of  a  visual  environment  influences  the  accuracy  of  localization  of  invisible 
(or  indiscernible)  sound  sources  in  space.  Shelton  et  al.  (1982)  reported  that  listeners  who  could 
move  their  head  and  see  their  surroundings  made  fewer  localization  errors  than  listeners  who  had 
their  eyes  covered  with  opaque  goggles,  even  when  no  visual  information  was  associated  with 
the  sound  sources.  They  further  hypothesized  that  head  movements  improve  localization  acuity 
only  in  the  presence  of  visal  cues.  In  contrast,  Bauer  and  Blackmor  (1965)  observed  that 
aiuditory  localization  acuity  was  the  same  in  daylight  and  in  darkness.  Eovelace  and  Anderson 
(1993)  compared  listeners’  localization  acuity  of  non-visible  sound  source  with  eyes  open  and 
closed  during  sound  presentation  and  also  found  no  difference.  They  also  argued  (p.  843)  that  if 
any  effect  of  vision  on  auditory  localization  acuity  should  be  expected  it  should  be  a  negative 
rather  than  positive  effect  “since  visual  influence  can  introduce  interference  that  would  increase 
the  magnitude  of  error  in  sound  localization  (as  occurs  in  visual  capture)  one  might  even 
hypothesize  that  closing  one’s  eyes  might  result  in  improved  accuracy  of  sound  localization.” 
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This  view  is  supported  by  extensive  anecdotal  evidence  and  by  conclusion  reached  by  King 
(2009,  p.  331)  in  a  review  of  visual  influences  on  auditory  spatial  learning  that  “accurate,  and 
even  supra-normal,  auditory  localization  abilities  can  be  achieved  in  the  absence  of  vision.”  This 
also  agrees  with  the  observations  that  blind  listeners  are  at  least  comparable  and  usually  slightly 
better  than  sighted  listeners  in  performing  localization  task  (Ashmead  et  ah,  1998;  Lessard  et  ah, 
1998;  Starlinger  and  Niemeyer,  1981;  Simon  et  ah,  2002).  The  point  of  view  that  general 
visibility  of  surroundings  is  usually  not  helpful  in  auditory  localization  task  is  further  supported 
by  Lewald  (2007)  who  reported  that  short-term  (90  min)  light  deprivation  prior  to  the 
localization  task  improves  localization  accuracy  (but  not  localization  precision;  see  section  5). 

Additional  general  factors  that  may  affect  sound  source  localization  are  the  listener’s  familiarity 
with  the  sound  source  and  the  listener’s  expectations.  Various  authors  have  reported  that  the 
localization  of  unfamiliar  sounds  is  worse  than  that  of  familiar  ones  (Blauert,  1974/2001;  Brown 
and  May,  2005;  Coleman,  1962;  Kietz,  1953;  Plenge  and  Brunschen,  1971;  Plenge,  1972).  This 
is  related  to  the  fact  that  in  order  for  a  listener  to  take  advantage  of  the  fact  that  the  spectrum  of 
an  arriving  sound  depends  on  the  angle  of  its  incidence,  the  sound  must  be  known  to  the  listener. 
For  example,  familiarity  with  the  sound  source  (e.g.,  a  voice  of  a  particular  person)  may  help  to 
disambiguate  potential  front-back  confusion  and  determine  whether  the  sound  is  coming  from 
the  front  or  from  the  rear.  Blauert  (1974/2001)  cited  two  studies  (Blauert,  1970;  Wettschurek, 
1971)  in  which  listeners  localized  familiar  and  unfamiliar  voices  in  the  median  plane  and 
reported  localization  errors  of  9°  and  17°,  respectively.  Similarly,  sound  coloration  may  indicate 
whether  the  sound  source  is  behind  another  object  or  in  a  direct  path  to  the  listener.  Once  the 
perceived  position  of  the  sound  source  is  stored  in  the  listener’s  memory,  it  aids  in  localization 
(Han,  1992).  The  role  of  familiarity  in  localization  performance  also  underscores  reports  that 
some  hearing  aid  users  localize  worse  with  hearing  aids  than  without  (e.g..  Noble  and  Byrne, 
1990).  Yet,  despite  the  plethora  of  localization  cues,  some  listeners’  expectations  are  so  strong 
that  they  can  override  all  the  auditory  cues.  Even  if  an  eagle’s  cry  is  played  from  a  loudspeaker 
located  on  the  ground  (outdoors),  most  people  will  still  first  look  to  the  sky. 

2.5  Directional  Bands 

Since  sound  source  localization  in  the  vertical  plane  depends  greatly  on  modifications  to  the 
sound  spectrum  by  torso  and  pinna  reflections,  perceived  changes  in  source  elevation  can  be  also 
produced  by  deliberate  changes  in  the  sound  spectrum  without  moving  the  physical  source  (Xu  et 
ah,  2000).  Blauert  (1968;  1969),  Middlebrooks  and  Green  (1991),  Rogers  and  Butler  (1992), 
Middlebrooks  (1992),  and  others  have  demonstrated  that  for  continuous  tones  and  narrow  noise 
bands  the  perceived  location  of  the  sound  source  in  the  median  plane  is  not  related  to  the  actual 
position  of  the  sound  source  but  to  the  dominating  frequency  of  the  sound  when  the  head  is  kept 
in  a  fixed  position.  For  example,  the  sound  spectra  at  the  ears  for  sounds  arriving  from  the 
frontal,  overhead,  and  rear  directions  have  peaks  at  around  250-500  Hz  and  2-5  kHz,  6-8  kHz, 
and  0.8-1. 6  and  10-12  kHz,  respectively  (Blauert,  1968;  Han,  1991;  1992;  Hebrank  and  Wright, 
1974;  Itoh  et  ah,  2007;  Morimoto  and  Aokata,  1984;  Wright  et  ah,  1974).  A  schematic  view  of 
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the  distribution  of  directional  bands  along  the  frequency  scale  is  shown  in  figure  4.  Thus,  tones 
and  narrowband  noises  in  the  respective  bands  have  a  tendency  to  be  localized  as  arriving  from 
the  frontal,  overhead,  and  rear  directions  regardless  of  the  actual  position  of  the  sound  source. 
However,  large  individual  differences  are  to  be  expected  (Itoh  et  ah,  2007),  and  the  effect  mostly 
disappears  for  dynamically  changing  stimuli.  Blauert  (1969;  1974/2001)  used  the  term 
directional  bands  to  describe  this  phenomenon  and  the  directions  assigned  to  the  specific 
frequency  bands. 


Figure  4.  Directional  bands  in  the  median  plane.  The  angles  0°,  90°,  and  180° 
indicate  front,  up,  and  back  directions,  respectively.  Adapted  from 
Blauert  (1974/2001). 

It  can  be  hypothesized  that  reports  indicating  that  harmonic  structure  is  more  important  for 
grouping  acoustic  stimuli  in  space  than  their  actual  spatial  proximity  (e.g.,  Buell  and  Hafter, 
1991)  may  be  related  to  the  phenomenon  of  directional  bands. 

2.6  Effects  of  Hearing  Loss,  Age,  and  Gender 

2.6.1  Hearing  Loss 

Generally,  asymmetrical  (unilateral)  hearing  loss  decreases  localization  performance  in  the 
horizontal  plane  (e.g.,  Comalli  and  Altshuler,  1976;  Hattori,  1966;  Hausler  et  ah,  1983;  Link  and 
Lehnhardt,  1966;  Matzker  and  Springbom,  1958;  Newton  and  Hickson,  1981;  Viehweg  and 
Campbell,  1960).  This  decrease  is  always  present  when  peripheral  asymmetry  is  artificially 
introduced  by  an  earplug  and  is  usually,  but  not  always  (see  appendix  B  on  localization  training), 
present  when  the  asymmetry  is  caused  by  differences  between  ear  sensitivities  (Nabelek  et  ah, 
1980).  In  both  cases,  however,  the  decrease  in  performance  seems  to  be  worse  if  the  left  ear  is 
the  “better  ear”  (Bess  et  ah,  1986;  Gustafson  and  Hamill,  1995).  In  contrast,  symmetrical 
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hearing  loss  of  as  much  as  30-40  dB  has  been  reported  by  several  authors  to  have  little  effect  on 
localization  performance  in  the  horizontal  plane  (e.g.,  Abel  and  Hay,  1996;  Blauert,  1974/2001; 
Butler,  1970;  Rosner,  1965;  Tonning,  1973b).  There  is  a  consensus  among  researchers  on  the 
very  slight  effect  of  symmetrical  sensorineural  (high  frequency)  hearing  loss  on  localization 
performance.  However,  a  significant  decrease  in  localization  performance  was  reported  by  some 
authors  in  the  case  of  conductive  (low  frequency)  hearing  loss  (e.g..  Gatehouse  and  Pattee,  1983; 
Noble  et  ah,  1994;  1997).  Noble  and  his  colleagues  attributed  this  finding  to  the  disruption  of 
ITD  cues  and  the  increased  role  of  bone  conduction  in  sound  transmission  to  the  inner  ear  (Noble 
et  ah,  1994;  1997). 

Asymmetrical  hearing  loss  results  in  large  localization  errors  in  the  horizontal  plane,  but  even 
total  deafness  in  one  ear  allows  some  degree  of  horizontal  sound  source  localization  (Bochenek 
and  Mitkiewicz-Bochenek,  1963;  Tonning,  1973b).  It  is  important  to  note  that  with  time  and 
experience,  the  size  of  localization  errors  made  by  people  with  asymmetrical  hearing  loss  is 
gradually  reduced  (Angell  and  Fite,  1901b;  Perrott  and  Elfner,  1968;  Hausler  et  ah,  1983).  This 
may  be  due  to  progressively  greater  use  of  head  movements  in  directional  recognition  and 
greater  experience  in  using  new  localization  cues. 

Similarly  to  reports  on  the  effect  of  bilateral  hearing  loss  on  localization  in  the  horizontal  plane, 
localization  in  the  vertical  plane  seems  to  depend  on  the  type  of  hearing  loss.  Listeners  with 
bilateral  sensorineural  (high  frequency)  hearing  loss  are  reported  to  perform  worse  than  listeners 
with  conductive  hearing  loss  (Butler,  1970;  Noble  et  ah,  1994).  However,  contrary  to 
localization  in  the  horizontal  plane,  monaural  localization  in  the  vertical  plane  is  barely  affected 
by  hearing  loss  (Angell  and  Fite,  1901a;  Butler,  1970). 

2.6.2  Age 

In  contrast  to  the  very  limited  effect  of  the  observer’s  age  on  visual  spatial  perception,  several 
authors  have  reported  a  noticeable  effect  of  age  on  auditory  localization  (e.g.,  Abel  and  Hay, 
1996;  Dobreva,  2010;  Hattori,  1966;  Link  and  Lehnhardt,  1966;  Matzker  and  Springborn,  1958; 
Tonning  1973b;  Viehweg  and  Campbell,  1960).  In  an  extensive  study,  Abel  et  al.  (2000) 
investigated  the  effect  of  aging  on  localization  in  the  horizontal  plane  for  7  groups  of  16 
listeners,  aged  10-81,  and  reported  a  decrease  in  performance  as  early  as  in  the  third  decade. 
Using  the  categorical  localization  paradigm  (see  section  1 1)  and  many  different  arrangements  of 
loudspeakers,  they  observed  decrements  in  localization  performance  on  the  order  of  12%-15% 
across  all  age  groups.  The  decrease  was  largest  for  low  frequency  noise  (i.e.,  ITD  differences) 
and  the  smallest  for  broadband  noise  (i.e.,  IID+ITD  differences).  Similar  findings  were  reported 
by  Babkoff  et  al.  (2002),  who  reported  that  the  accuracy  of  ITD-based  sound  source 
lateralization  declines  substantially  with  age,  while  IID-based  lateralization  does  not  and  that  the 
age-related  worsening  in  temporal  resolution  may  affect  the  performance  of  auditory 
localization.  These  data  support  the  idea  presented  by  Scharf  et  al.  (1976)  that  the  human  ability 
to  analyze  the  frequency  content  of  an  incoming  signal  and  localize  it  on  the  basis  of  ITDs  are 
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closely  related.  In  contrast,  Savel  (2009)  found  no  affect  of  age  on  localization  acuity  (both 
accuracy  and  precision;  see  section  5)  or  on  intrasubject  variability  for  localization  of  low- 
frequency  noises  in  the  horizontal  plane. 

Several  authors  have  argued  that  the  age-related  decline  in  localization  performance  may  be  due, 
at  least  partially,  to  the  confounding  effect  of  age-related  hearing  loss  (e.g.,  Nordlund,  1964; 
Terhune,  1974).  While  this  argument  cannot  be  solely  dismissed  on  the  basis  of  the  studies 
discussed  in  the  previous  paragraph  (several  authors  used  age-corrected  norms  of  normal 
hearing),  several  other  studies  have  demonstrated  that  symmetrical  hearing  loss  in  young  people 
does  not  appreciably  affect  localization  performance  as  long  as  the  arriving  sounds  are  clearly 
audible  (see  section  2.6.1  above),  leaving  an  age-effect  as  the  main  source  of  declining 
localization  performance. 

2.6.3  Gender 

Nilsson  et  al.  (1973)  and  Newton  and  Hickson  (1981)  found  no  difference  in  the  auditory 
localization  ability  of  female  and  male  listeners.  Langford  (1994)  and  Saberi  and  Antonio  (2003; 
2004)  observed  some,  but  small,  gender-related  differences  in  the  discrimination  of  ITD  and  IID 
cues,  with  female  listeners  being  somewhat  less  sensitive  and  more  variable  in  their  performance 
than  male  listeners.  Larger  gender-related  functional  asymmetries  in  auditory  spatial  perception 
have  been  reported  by  Lew  aid  (2004).  In  a  simple  pointing  task  testing  monaural  sound 
localization  in  the  vertical  plane,  female  listeners  were  more  precise  when  listening  with  the  left 
ear,  although  male  listeners  did  better  with  the  right  ear.  This  was  attributed  to  sexual 
dimorphism  of  the  posterior  parietal  cortex,  or  planum  temporale,  both  areas  known  to  be 
involved  in  spatial  auditory  functions.  These  results  agree  with  Savel’s  (2009)  observation  that 
male  listeners  frequently  have  asymmetrical  spatial  acuity,  favoring  the  left-hemisphere.  Greater 
asymmetry  in  the  planum  temporale  in  males  than  in  females  has  been  implicated  as  one  of  the 
potential  causes  of  the  asymmetric  perception  (e.g.,  Voyer,  1996).  More  recently,  Zundorf  et  al. 
(201 1)  reported  that  while  localization  acuity  in  quiet  is  not  gender-dependent,  female  listeners 
have  greater  difficulty  in  localizing  sound  sources  in  noise  environments  (cocktail  party  effect) 
and  are  more  prone  to  reversal  errors  (see  section  10). 


3.  Physiology  of  Auditory  Localization 


The  acoustic  coding  of  spatial  information  is  the  result  of  the  physical  spacing  of  the  ears  and  the 
filtering  properties  of  the  human  body,  including  the  torso,  head,  and  pinnae.  The  spatial  cues 
embedded  in  the  auditory  signal  are  additionally  amplified  or  attenuated  in  the  process  of 
impedance  transformation  while  the  auditory  stimulus  travels  from  the  outer  ear  to  the  cochlea. 
The  complex  auditory  signal  reaching  the  cochlea  is  sampled  and  frequency  analyzed  and  finally 
converted  into  neural  responses  that  are  transmitted  to  the  central  nervous  system  (CNS)  by  the 
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bundle  of  neurons  forming  the  auditory  nerve.  The  neural  responses  from  the  left  and  right  ear 
converge  at  the  binaural  neural  centers  of  the  CNS  that  merge  the  left  and  right  input  signals  into 
binaural  neural  code.  A  schematic  drawing  of  the  auditory  pathways  of  the  central  nervous 
system  is  shown  in  figure  5.  According  to  Boehnke  and  Phillips  (1999),  human  localization 
ability  in  the  horizontal  plane  is  based  on  the  input  information  received  from  two  broadly  tuned 
spatial  channels,  as  opposed  to  many  direction- specific  channels.  These  two  channels  occupy 
the  left  and  right  auditory  hemifields,  respectively,  with  each  extending  30°  across  the  median 
plane. 


Figure  5.  Auditory  pathways  in  the  central  nervous  system.  LE  -  left 
ear,  RE  -  right  ear,  AN  -  auditory  nerve,  CN  -  cochlear 
nucleus,  TB  -  trapezoid  body,  SOC  -  superior  olivary 
complex,  EL  -  lateral  lemniscus,  IC  -  inferior  colliculus. 

Adapted  from  Aharonson  and  Eurst  (2001). 

The  auditory  fibers  leaving  the  left  and  right  inner  ear  connect  directly  to  the  synaptic  inputs  of 
the  cochlear  nucleus  (CN)  on  the  same  (ipsilateral)  side  of  the  brainstem.  The  CN  contains  a 
mass  of  nerve  cell  bodies  on  which  nerve  fibers  form  connections  and  is  made  of  two  smaller 
nuclei:  the  dorsal  cochlear  nucleus  (DCN)  and  the  ventral  cochlear  nucleus  (VCN).  These  are 
formed  by  type  IV  cells  and  bushy  cells,  respectively  (e.g.,  Shofner  and  Young,  1985).  These 
types  of  cells  are  sensitive  to  changes  in  sound  intensity,  frequency,  and  onset  and  offset  as  well 
as  to  the  notches  in  the  spectral  content  of  the  sound  and  make  up  the  initial  stage  of  neural 
processing  of  auditory  stimuli  (e.g.,  Hancock  and  Voigt,  1999;  Imig  et  ah,  2000).  The  bushy 
cells  of  the  VCN  connect  to  the  ipsilateral  superior  olivary  complex  (SOC),  which  is  the  next 
processing  stage  in  the  auditory  pathway.  The  type  IV  cells  of  the  DCN  bypass  the  SOC  and 
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connect  directly  to  the  ipsilateral  inferior  colliculus  (IC)  located  higher  in  the  processing  chain 
(Davis  et  ah,  2003). 


The  SOC  is  the  lowest  level  point  in  the  brainstem  where  the  neural  fibers  conveying  the 
auditory  signals  from  left  and  right  ear  decussate  (cross  from  one  side  of  the  nervous  system  to 
another)  and  is  the  principal  site  of  binaural  convergence  (King  et  ah,  2001).  The  SOC  receives 
inputs  from  both  the  ipsilateral  and  contralateral  CNs  and  generates  neural  signals  conveying 
information  about  the  location  of  the  sound  source  in  the  horizontal  plane.  The  SOCs  consist  of 
four  nuclei,  but  only  the  medial  superior  olivary  (MSO)  and  lateral  superior  olivary  (LSO)  nuclei 
receive  inputs  from  both  ears.  The  inputs  from  the  contralateral  ear  are  passed  through  the 
trapezoid  body  (TB),  which  serves  as  a  switch,  changing  the  excitatory  signal  into  an  inhibitory 
signal  (Fitzpatrick  et  ah,  2002).  The  signal  switching  and  processing  at  the  TB  level  is  critical 
for  normal  directional  hearing.  A  number  of  animal  studies  have  demonstrated  that  the 
interruption  of  the  neural  pathways  passing  through  the  TB  severely  limits  an  animal’s  ability  to 
localize  sound  (e.g.,  Masterton  et  ah,  1967;  Moore  et  ah,  1974). 

There  are  two  SOCs  (the  left  and  the  right)  in  the  brainstem,  and  most  of  the  innervations 
arriving  from  one  ear  terminate  at  the  ipsilateral  LSO  and  contralateral  MSO  nuclei.  The  MSO 
and  LSO  nuclei  are  mostly  composed  of  two-input  excitatory-excitatory  (EE)  and  excitatory- 
inhibitory  (El)  neuron  cells  that  operate  as  coincidence  and  difference  detectors  (Goldberg  and 
Brown,  1969;  Emanuel  and  Eetowski,  2009).  These  cells  are  sensitive  to  binaural  differences 
and  perform  initial  coding  of  ITDs  (mostly  in  the  MSO;  Masterton  and  Diamond,  1967;  Brand  et 
al.  2002)  and  IIDs  (mostly  in  the  ESO;  Boudreau  and  Tsuchitani,  1968;  Guinan  et  ah,  1972; 
Irvine  et  al.  2001;  Park,  1998;  Yue  and  Johnson,  1997).  Thus,  it  appears  that  the  MSO  and  ESO 
serve  as  binaural  time  difference  and  spectral  difference  analyzers,  respectively  (Gatehouse, 
1982,  p.l  1).  The  ITD  is  encoded  by  phase/time  locking  and  the  IID  by  a  spike  rate  (Zupanc, 
2004). 

The  projections  from  the  CN  and  SOC  on  each  side  of  the  brainstem  to  the  ipsilateral  IC  (see 
figure  5)  form  the  corresponding  lateral  lemniscus  (EE),  which  is  the  largest  fiber  tract  in  the 
auditory  brainstem.  The  ICs  are  where  the  temporal  and  spectral  pattern  information  processed 
in  the  cochlear  nucleus  is  integrated  with  the  binaural  ITD  and  IID  information  arriving  from  the 
SOCs.  At  the  EE/IC  level,  the  auditory  pathways  re-cross,  providing  additional  coding  of 
binaural  information.  The  importance  of  this  neural  bridge,  known  as  the  commissure  ofProbst, 
can  be  demonstrated,  for  example,  by  severing  it,  which  results  in  a  marked  decrease  in 
localization  performance  in  the  midline  plane  (Itoh  et  al.,  1996). 

The  ICs  can  be  considered  as  the  central  stage  of  binaural  processing  in  the  brainstem,  because 
all  the  individual  pathways  from  the  CNs,  ESOs,  MSOs,  and  EEs  terminate  at  the  ICs  (Batra  and 
Eitzpatrick,  2002;  Casseday  and  Covey,  1987).  Most  notable  is  the  further  processing  of  IIDs  at 
the  IC  level.  While  ESO  processing  is  sensitive  to  small  IIDs,  IC  processing  is  biased  toward 
more  global  differences  (Eitovsky  et  al.,  2002;  Park,  1998).  There  are  reports  indicating  that 
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brain  lesions  at  the  SOC  level  can  cause  small  interaural  differences  to  be  perceived  as  larger 
ones  and  that  brain  lesions  at  the  IC  level  have  the  opposite  perceptual  effect  (e.g.,  Aharonson  et 
ah,  1998;  Furst  et  ah,  1995;  Kavanagh  and  Kelly,  1992).  According  to  Aharonson  and  Furst 
(2001,  p.  2850)  the  SOC  “seeks  dissimilarity”  and  the  IC  “seeks  similarity”  between  the  left  and 
right  inputs.  It  has  also  been  hypothesized  that  the  IC  is  responsible  for  the  existence  of  the 
precedence  effect  (e.g.,  Yin,  1994)  and  may  facilitate  both  vertical  localization  and  the 
perception  of  sound  echoes. 

The  auditory  information  integrated  in  the  IC  is  further  processed  by  the  superior  colliculus  (SC) 
(Oliver  and  Huerta,  1992;  King  et  ah,  2001).  Data  reported  by  Middlebrooks  and  Knudsen 
(1984)  and  King  and  Hutchings  (1987)  indicate  that  the  topographic  representation  of  auditory 
space  is  already  developed  at  the  SC  level  before  being  remapped  at  the  cortical  levels.  The  final 
stage  of  auditory  information  processing  in  the  brainstem  is  the  medial  geniculate  body  (MGB) 
of  the  thalamus,  which  is  the  entry  point  of  the  auditory  information  to  the  brain  (Starr  and  Don, 
1972;  Winer,  1992).  From  here  the  signals  are  projected  to  the  auditory  cortex,  one  on  each  side 
of  the  brain,  and  recoded  to  form  a  spatiotemporal  distribution  of  activity  within  the  brain 
(Hackett,  2011;  King  et  ah,  2001).  According  to  Palomaki  et  al.  (2000)  spatial  stimuli  elicit 
predominantly  contralateral  activity  in  the  auditory  cortex,  and  the  combined  spatial  information 
is  processed  in  the  right-hemisphere  of  the  brain. 

The  high-level  bridge  between  the  left  and  right  parts  of  the  nervous  system  is  the  corpus 
callosum  (CC),  which  connects  the  left  and  right  parts  of  the  brain.  This  late  bridge  also 
contributes  to  overall  binaural  auditory  localization  ability  (Musiek  and  Weihing,  2011). 
However,  its  disruption  is  not  as  detrimental  to  directional  hearing  as  the  disruption  of  the  TB  or 
IC  bridges.  For  example,  Lassard  et  al.  (2002)  demonstrated  that  people  with  callosal  agenesis 
and  early  callosotomy  had  greater  difficulties  with  the  binaural  localization  of  moving  sound 
sources  than  listeners  in  the  control  group.  However,  some  of  the  test  listeners  outperformed  the 
listeners  of  the  control  group  in  localizing  stationary  sound  sources.  This  was  interpreted  as 
indicating  that  the  absence  of  the  CC  caused  some  subjects  to  make  more  efficient  use  of 
monaural  cues.  The  efficiency  of  information  integration  over  the  CC  has  been  reported  to 
decline  to  some  degree  with  age  (especially  during  the  40-55  age  period)  (Beilis  and  Wilber, 
2001),  and  this  may  partially  explain  the  observed  age-related  decline  in  localization  ability 
discussed  in  section  2.6.2. 

The  coding  of  spatial  information  at  the  SOC  and  LL  levels  and  the  decoding  of  this  information 
in  the  auditory  cortex  of  the  brain  are  the  three  main  neural  processes  forming  our  auditory 
spatial  perception  (Masterton  et  al.,  1967;  Mpller,  2000).  Recent  neuroimaging  studies  have 
shown  that  the  processing  of  auditory  spatial  information  takes  place  in  a  distributed  network  of 
brain  areas  including  the  superior,  middle,  and  inferior  frontal  gyri  and  the  posterior  and  inferior 
parietal  and  middle  temporal  cortices  (Bushara  et  al.  1999;  Kaiser  and  Bertrand,  2003;  Maeder  et 
al.,  2001;  Martinkauppi  et  ah,  2000;  Weeks  et  al.  1999).  The  spike  patterns  (spike  counts  and 
spike  timing)  of  the  auditory  cortical  neurons  carry  integrated  information  about  sound  source 
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location  in  both  the  horizontal  and  vertical  dimensions  (Middlebrooks  et  ah,  1994;  1998;  Pickles, 
2003;  Xu  et  ah,  1998;  2000),  and  some  neurons  are  especially  sensitive  to  this  type  of 
information  (Middlebrooks  and  Pettigrew,  1981). 

Several  authors  (e.g..  Las  et  ah,  2008;  Meredith  and  Clemo,  1989;  Santamaria  et  ah,  2009)  have 
reported  that  the  cortical  region  of  the  anterior  ectosylvian  sulcus  (AES)  plays  an  important  role 
in  facilitating  SC  response  to  auditory  and  visual  stimuli.  Korte  and  Rauschecker  (1993)  and 
Rauschecker  and  Korte  (1993)  observed  that  after  suturing  kittens’  eyes  shut,  the  kittens 
developed  a  smaller  visual  area  and  larger  auditory  area  in  the  AES.  The  authors  hypothesized 
this  may  also  apply  to  blind  people,  who  depend  greatly  on  auditory  cues  (Doucet  et  ah,  2005). 

Numerous  studies  have  indicated  that  the  mapping  of  auditory  neural  responses  to  relative 
coordinates  in  space  is  a  learned  process  and  that,  if  needed,  this  mapping  is  developed  and 
modified  over  time  (e.g..  King  et  ah,  2001).  It  is  also  a  probabilistic  process  in  which  the 
majority  of  the  neural  responses  determines  the  final  mapping.  Moreover,  it  is  not  an  isolated 
process,  but  rather  a  synergic  one,  in  which  spatial  auditory  information  is  moderated  by  other 
sensory  inputs  to  the  brain,  including  the  sense  of  balance  and  various  higher  order  brain 
processing  centers. 


4.  Terminology,  Notation,  and  Conventions 


Depending  on  the  task  given  to  the  listener,  there  are  two  basic  types  of  localization  judgments: 

•  Relative  localization  (discrimination  task) 

•  Absolute  localization  (identification  task) 

Relative  localization  judgments  are  made  when  one  sound  source  location  is  compared  to 
another,  either  simultaneously  or  sequentially.  These  judgments  are  made  to  determine  spatial 
resolution  of  events  in  a  given  environment  or  assess  the  listener’s  ability  to  discriminate  sounds. 
Absolute  localization  judgments  involve  only  one  sound  source  location  that  needs  to  be 
identified.  In  addition,  absolute  localization  judgments  can  be  made  on  a  continuous  circular 
scale  and  expressed  in  degrees  (°)  or  can  be  restricted  to  a  limited  set  of  preselected  directions. 
The  latter  type  of  judgment  occurs  when  all  the  potential  sound  source  locations  are  marked  by 
labels  (e.g.,  number),  and  the  listener  is  asked  to  identify  the  sound  source  location  by  label.  The 
actual  sound  sources  may  or  may  not  be  visible.  This  type  of  localization  judgment  is  referred  to 
throughout  this  report  as  categorical  localization. 

From  the  listener’s  perspective,  the  most  complex  and  demanding  judgments  are  absolute 
localization  judgments,  and  they  are  the  main  subject  of  this  report.  The  other  two  types  of 
judgments,  discrimination  judgments  and  categorization  judgments,  are  only  briefly  described 
and  compared  to  absolute  judgments  later  in  the  report. 
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In  order  to  assess  human  sound  source  localization  ability,  the  physical  reference  space  needs  to 
be  defined  in  relation  to  the  position  of  the  human  head.  This  reference  space  can  be  described 
either  in  the  rectangular  or  polar  coordinate  system.  The  rectangular  coordinate  system  x,  y,  z  is 
the  basis  of  Euclidean  geometry  and  is  also  called  the  Cartesian  coordinate  system.  In  the 
human-body-oriented  Cartesian  coordinate  system  the  x,  y,  and  z  axes  are  typically  oriented  as 
left-right  (west-east),  back-front  (south-north),  and  down-up  (nadir- zenith),  respectively^'^.  The 
right,  front,  and  up  directions  indicate  the  positive  ends  of  the  scales.  The  Euclidean  planes 
associated  with  the  Cartesian  coordinate  system  are  the  vertical  lateral  (x-z),  the  vertical  sagittal 
(y-z),  and  the  horizontal  (x-y)  planes. 

In  reference  to  the  anatomy  of  the  human  body,  the  relative  orientations  of  the  Euclidean  planes 
are  shown  in  figure  6.  A  sagittal  plane  is  a  vertical  plane  that  runs  from  front  to  back  dividing 
the  body  into  right  and  left  sections.  A  lateral  plane  is  a  vertical  plane  that  passes  from  left  to 
right  and  divides  the  body  into  front  and  back  sections.  A  horizontal  plane  is  a  plane 
perpendicular  to  the  sagittal  and  lateral  planes  and  divides  the  human  body  into  superior  (upper) 
and  inferior  (lower)  sections. 

The  following  are  the  main  reference  planes  of  symmetry  of  the  human  body: 

•  Median  sagittal  (midsagittal)  plane:  y-z  plane 

•  Erontal  (or  coronal)  lateral  plane:  x-z  plane 

•  Axial  (transversal,  transaxial)  horizontal  plane:  x-y  plane 

The  median  (midsagittal)  plane  is  the  sagittal  plane  (figure  6)  that  is  equidistant  from  both  ears. 
The  virtual  line  passing  though  both  ears  is  called  the  interaural  axis.  The  ear  closer  to  the 
sound  source  is  termed  the  ipsilateral  ear  and  the  ear  farther  away  from  the  sound  source  the 
contralateral  ear.  The  frontal  (coronal)  plane  is  the  lateral  plane  that  divides  the  listener’s  head 
into  front  and  back  hemispheres  along  the  interaural  axis.  The  axial  (transversal)  plane  is  the 
main  horizontal  plane  of  symmetry  of  the  human  body,  passing  through  the  waist.  In  the  head- 
centered  frame  of  reference,  the  axial  place  is  replaced  by  the  horizontal  plane  passing  through 
the  interaural  axis,  which  is  referred  to  as  the  interaural  plane^^.  Any  references  in  this  report  to 
the  horizontal  plane  refer  to  the  interaural  plane. 


^'^In  some  publications  the  x-axis  is  oriented  as  front-back  and  the  y  axis  as  right-left  (e.g.,  Gerzon,  1992). 
^^Knudsen  (1982)  refers  to  the  interaural  plane  as  the  visuoaural  plane. 
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Figure  6.  The  main  reference  planes  of  the 
human  body. 

The  polar  coordinate  system  can  be  used  both  in  Euclidean  geometry  and  in  the  spherical,  non- 
Euclidean,  geometry  that  is  useful  in  describing  relations  between  points  on  a  closed  surface 
such  as  a  sphere.  In  the  polar  system  of  coordinates,  the  reference  dimensions  are  d  (distance  or 
radius),  6  (declination  or  azimuth),  and  cp  (elevation).  Distance  is  the  amount  of  linear  separation 
between  two  points  in  space,  usually  between  the  observation  point  and  the  target.  The  angle  of 
declination  (azimuth)  is  the  horizontal  angle  between  the  medial  plane  and  the  line  connecting 
the  point  of  observation  to  the  target.  The  angle  of  elevation  is  the  vertical  angle  between  the 
interaural  plane  and  the  line  from  the  point  of  observation  to  the  target.  The  Cartesian  and  polar 
systems  are  shown  together  in  figure  7. 


Figure  7.  Commonly  used  symbols  and  names  in 
describing  spatial  coordinates. 


Although  the  polar  coordinate  system  based  on  distance,  azimuth,  and  elevation  is  almost 
universally  used  across  the  world,  and  particularly  in  localization  studies,  it  is  not  the  only 
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system  used.  Thus,  to  differentiate  it  from  other  polar  systems,  it  is  frequently  referred  to  as  the 
vertical-polar  coordinate  system. 

A  characteristic  property  of  the  vertical-polar  coordinate  system  is  that  the  length  of  an  arc 
between  two  angles  of  azimuth  depends  on  elevation.  As  a  result,  the  separation  distance 
between  N  points  evenly  distributed  at  a  fixed  elevation  differs  depending  on  the  elevation.  The 
points  will  be  closer  together  near  the  poles  and  farther  apart  near  the  equator.  This  means  that 
the  points  will  not  be  uniformly  distributed  over  the  whole  sphere.  A  uniform  distribution  is 
desirable,  for  example,  in  placing  the  loudspeakers  in  3-D  localization  studies  and  selecting  the 
most  efficient  number  of  test  points  in  HRTF  measurements.  In  such  cases,  the  separation  of 
points  by  constant  angle  of  azimuth  is  not  an  effective  solution. 

Another  polar  coordinate  system  that  is  used  in  auditory  localization  studies  is  the  interaural- 
polar  coordinate  system  based  on  distance,  lateral  angle,  and  rising  angle  coordinates 
(Morimoto,  2001;  Morimoto  and  Aokata,  1984;  Morimoto  et  ah,  1983;  2002).  The  concept  of 
the  interaural-polar  coordinate  system,  also  referred  to  as  horizontal-polar  coordinate  system 
(Macpherson  and  Middlebrooks,  2002),  is  shown  in  figure  8. 


Figure  8.  The  interaural-polar  coordinate  system.  S  -  sound 

source,  O  -  center  of  the  listener’s  head,  d  -  distance 
between  the  sound  source  and  the  center  of  the 
listener’s  head;  0  -  azimuth  angle;  cp  -  elevation 
angle;  a  -  lateral  angle;  and  P  -  rising  angle.  Adapted 
from  Morimoto  and  Aokata  (1984). 

The  lateral  angle  a  is  the  angle  between  the  interaural  axis  and  the  direction  toward  the  sound 
source.  The  concept  of  such  an  angle  and  the  name  lateral  angle  was  originally  introduced  by 
Wallach  (1940)  in  his  study  of  the  role  of  head  movement  in  localization.  The  raising  angle  ji  is 
the  angle  between  the  horizontal  plane  and  the  plane  passing  through  the  interaural  axis  and  the 
location  of  the  sound  source.  The  lateral  angle  is  frequently  referred  to  as  the  binaural  disparity 
cue  and  the  rising  angle  as  the  spectral  cue  (Morimoto  and  Nomachi,  1982;  Morimota  and 
Aokata,  1984). 

The  main  advantage  of  the  interaural-polar  coordinate  system  over  the  vertical-polar  coordinate 
system  is  that  length  of  the  arc  (on  the  surface  of  the  sphere)  between  two  lateral  angles  is 
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independent  of  elevation.  However,  the  interaural-polar  coordinate  system  does  not  differ  from  a 
vertical-polar  coordinate  system  in  which  the  x  and  z  axes  have  been  exchanged.  Thus,  the 
length  of  the  arc  between  two  raising  angles  depends  on  the  lateral  angle,  which  again  leads  to  a 
non-uniform  distribution  of  points  on  the  surface  of  the  sphere.  The  difference  between  both 
systems  can  be  seen  in  figure  9. 


Figure  9.  Vertical-polar,  interaural-polar,  and  two-pole  coordinate  systems  (from  left  to  right).  Adapted 
from  Carlile  (1996). 

The  vertical-polar  and  interaural-polar  systems  are  both  single -pole  coordinate  systems,  and  the 
interaural-polar  system  may  just  as  well  be  called  the  horizontal -polar  coordinate  system. 
However,  despite  their  similar  limitations,  the  interaural-polar  system  does  have  an  advantage 
over  the  vertical-polar  system  in  localization  studies  since  localization  resolution  in  the  vertical 
plane  is  much  poorer  than  in  the  horizontal  plane. 

The  polar  coordinate  system  that  results  in  points  that  are  equally  separated  by  angle  of  azimuth 
being  uniformly  distributed  over  the  whole  sphere  is  the  two-pole  coordinate  system  shown  in 
the  right- most  panel  of  figure  9.  In  the  two-pole  system,  both  longitudes  and  latitudes  are 
represented  by  a  series  of  parallel  circles.  Though  less  intuitive,  this  system  may  be  convenient 
for  some  types  of  data  presentation,  e.g.,  for  comparing  arbitrary  angles  and  in  HRTF  studies 
(Knudsen,  1982;  Makous  and  Middlebrooks,  1990).  However,  as  all  three  of  the  systems  shown 
in  figure  9  usually  share  the  same  concepts  of  azimuth  and  elevation,  it  is  essential  that  the 
specific  spherical  coordinate  system  being  used  in  a  study  always  be  explicitly  stated  (Leong  and 
Carlile,  1998). 

It  should  also  be  noted  that  regardless  of  the  polar  coordinate  system  selected,  there  are  two 
conventions  for  numerically  labeling  angular  degrees  that  are  used  in  the  scientific  literature:  the 
360°  scheme  and  the  ±180°  scheme.  There  are  also  two  possibilities  for  selecting  the  direction  of 
positive  angular  change:  clockwise  (e.g.,  Tonning,  1970)  and  counterclockwise  (e.g.,  Pedersen 
and  Jorgensen,  2005). 

The  use  of  two  notational  schemes  is  primarily  a  nuisance  that  necessitates  data  conversion  in 
order  to  compare  or  combine  data  sets  labeled  with  different  schemes.  However,  converting 
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angles  that  are  expressed  differently  in  the  two  schemes  from  one  scheme  to  the  other  is  just  a 
matter  of  either  adding  or  subtracting  360°. 

In  the  case  of  localization  studies,  in  which  both  average  angles  and  differences  between  angles 
are  calculated,  the  +180°  labeling  scheme  is  overwhelmingly  preferred.  First,  it  is  much  simpler 
and  more  intuitive  to  use  positive  and  negative  angles  to  describe  angular  difference.  Second, 
the  direct  summing  and  averaging  of  angular  values  can  only  be  done  with  angles  that  are 
contained  within  a  (numerically)  continuous  range  of  180°,  such  as  +90°,  which  is  very  often  the 
whole  range  of  interest.  If  the  360°  scheme  is  used,  or  if  the  continuous  range  of  180°  in  a  +180° 
labeling  scheme  is  exceeded,  then  angles  to  the  left  and  right  of  0°  (the  reference  angle)  cannot 
be  directly  added  and  must  be  converted  into  vectors  and  added  using  vector  addition.  In  the 
case  of  angular  differences,  the  angle  of  360°  must  be  added  to  or  subtracted  from  (depending  on 
the  notation  scheme)  the  differential  angle. 

Less  clear  is  the  selection  of  the  positive  and  negative  directions  of  angular  difference. 

However,  if  the  +180°  scheme  is  used,  the  absolute  magnitude  of  angular  values  is  the  same 
regardless  of  directionality,  which  is  another  reason  to  prefer  the  +180°  scheme.  Under  the  360° 
scheme,  the  clockwise  measurement  of  any  angle  other  than  180°  will  have  a  different  magnitude 
than  that  same  angle  measured  counterclockwise,  i.e.,  30°  in  the  clockwise  direction  is  330°  in 
the  counterclockwise  direction. 

In  mathematics  (e.g.,  geometry)  and  physics  (e.g.,  astronomy),  a  displacement  in  a 
counterclockwise  direction  is  considered  positive,  and  a  displacement  in  a  clockwise  direction  is 
considered  negative.  In  geometry,  the  quadrants  of  the  circle  are  ordered  in  a  counterclockwise 
direction,  and  an  angle  is  considered  positive  if  it  extends  from  the  x-axis  in  a  counterclockwise 
direction.  In  astronomy,  all  the  planets  of  our  solar  system,  when  observed  from  “above”  the 
Sun,  rotate  and  revolve  around  the  Sun  in  a  counterclockwise  direction  (except  for  the  rotation  of 
Venus). 

However,  despite  the  scientific  basis  of  the  counterclockwise  rule,  the  numbers  on  clocks  and  all 
other  circular  measuring  scales,  including  the  compass,  increase  in  a  clockwise  direction, 
effectively  making  it  the  positive  direction.  This  convention  is  shown  in  figure  7  and  is  used 
throughout  this  report.  For  locations  that  differ  in  elevation,  the  upward  direction  from  a  0° 
reference  point  in  front  of  the  listener  is  normally  considered  as  the  positive  direction,  and  the 
downward  direction  is  considered  to  be  the  negative  direction. 

The  last  potential  difficulty  in  using  angular  scales  is  the  overlap  between  horizontal  and  vertical 
angular  information.  To  avoid  confusion  resulting  from  the  simultaneous  use  of  +180° 
horizontal  and  vertical  coordinates  in  the  vertical-polar  coordinate  system,  the  azimuth  is 
specified  in  the  +180°  range  and  elevation  in  the  +90°  range.  The  opposite  is  true  for  the 
interaural-polar  coordinate  system,  resulting  in  front/back  directions  expressed  as  0°  (front)  and 
+180°  (back)  elevation  angles.  In  the  two-pole  coordinate  system,  either  of  these  two 
conventions  can  be  used  but  must  be  clearly  specified. 
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5.  Accuracy  and  Precision  of  Auditory  Localization 


Human  judgment  of  sound  source  location  is  a  noisy  process  laden  with  judgment  uncertainty, 
which  leads  to  localization  errors.  Auditory  localization  error  (LE)  is  the  difference  between  the 
estimated  and  actual  directions  toward  the  sound  source  in  space.  This  difference  can  be  limited 
to  difference  in  azimuth  or  elevation  or  can  include  both  (e.g.,  Wightman  and  Kistler,  1989b; 
Carlile  et  al.,  1997).  The  latter  can  be  referred  to  as  compound  LE  [or  dual -plane  error  (Laroche, 
1994)]  and  considered  as  a  geometric  sum  of  the  horizontal  and  vertical  EEs  (e.g.,  Grantham  et 
ah,  2003): 

Sound  localization  error  (LE)  is  the  dijference  between  the  estimated  and  the  actual 

direction  toward  the  sound  source  in  space. 

Once  the  localization  act  is  repeated  several  times,  LE  becomes  a  statistical  variable.  The 
statistical  properties  of  this  variable  are  generally  described  by  spherical  statistics  due  to  the 
spherical/circular  nature  of  angular  values  (6  =  6  +  366°).  However,  if  the  range  of  the  angular 
judgments  is  limited  to  a  +90°  range,  the  data  distribution  can  be  assumed  to  have  a  linear 
character,  which  greatly  simplifies  data  analysis  (a  discussion  of  statistical  analyses  is  found  in 
sections  6  and  7).  Such  a  situation  is  typical  in  the  case  of  auditory  localization  judgments,  where 
the  vast  majority  of  EEs  are  either  local  errors  or  reversal  errors.  Local  errors,  or  genuine 
errors  (Eyring,  1945),  are  errors  within  +45°  of  the  mean,  and  in  practice  these  will  usually  stay 
within  +20°/+25°  (Carlile  et  al.  1997;  Scharine,  2009).  Reversal  errors,  also  called  confusion 
errors  (Eyring,  1945),  can  be  either  front-back  or  back-front  type  errors  that  are  larger  than  +90° 
and  usually  close  to  +180°  (Carlile  et  al.,  1997;  Scharine,  2005).  These  errors  are  a  special  class 
of  EEs  and  should  be  extracted  from  the  whole  data  set  and  analyzed  separately  in  order  to  avoid 
getting  an  erroneously  large  mean  localization  error  (e.g.,  Bergault,  1991;  Carlile  et  al.,  1997; 
Makous  and  Middlebrooks,  1990;  Oldfield  and  Parker,  1984a).  A  more  thorough  discussion  of 
reversal  errors,  their  effects  on  mean  localization  errors,  and  the  methods  for  accounting  for  them 
is  offered  in  section  10.  In  general,  regardless  of  whether  the  reversal  errors  are  pre-processed  or 
not,  the  joint  analysis  of  all  types  of  errors  should  only  be  done  under  specific  circumstances  and 
with  great  caution.  This  analysis,  if  performed,  should  always  accompany  the  separate  analyses 
of  both  types  of  errors,  since  on  its  own  such  an  analysis  may  lead  to  erroneous  conclusions.  The 
metrics  of  linear  statistics  commonly  used  to  describe  the  results  of  localization  studies  are 
discussed  in  section  6.  The  methods  of  spherical  (circular)  statistical  data  analysis  are  discussed 
in  section  7. 

The  probability  distribution  used  to  describe  localization  judgments,  and  in  fact  most  human 
judgment  phenomena,  is  the  normal  distribution,  also  known  as  the  Gaussian  distribution.  It  is  a 
purely  theoretical  distribution,  but  it  approximates  the  distribution  of  human  errors  well  and  is 
thus  commonly  used  in  experiments  with  human  subjects.  In  the  case  of  localization  judgments. 
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this  distribution  reflects  the  random  variability  of  the  localizations  while  emphasizing  the 
tendency  of  the  localizations  to  be  centered  on  some  direction  (ideally,  the  true  sound  source 
direction)  and  to  become  (symmetrically)  less  likely  the  further  away  we  move  from  that  central 
direction. 

The  normal  distribution  has  the  shape  of  a  bell  and  is  completely  described  in  its  ideal  form  by 
two  parameters:  the  mean  (p)  and  the  standard  deviation  (cr).  The  mean  corresponds  to  the 
central  value  around  which  the  distribution  extends,  and  the  standard  deviation  describes  the 
range  of  variation.  In  particular,  ~2/3  of  the  values  (68.2%)  will  be  within  one  standard  deviation 
from  the  mean,  i.e.,  within  the  range  \ji-  a,  ii  +  a\.  The  mathematical  form  and  graph  of  the 
normal  distribution  are  shown  in  figure  10. 


Figure  10.  Normal  distribution.  Standard  deviation  (cr)  is  the 
range  of  variability  around  the  mean  value  {/x  +  cr) 
that  accounts  for  ~2/3  of  all  responses. 

Based  on  the  above  discussion,  each  set  of  localization  judgments  can  be  described  by  a  normal 
distribution  with  a  specific  mean  and  standard  deviation.  Ideally,  the  mean  of  the  distribution 
should  correspond  with  the  true  sound  source  direction.  However,  any  lack  of  symmetry  in 
listener  hearing  or  in  the  listening  conditions  may  result  in  a  certain  bias  in  listener  responses  and 
cause  a  misalignment  between  the  perceived  location  of  the  sound  source  and  its  actual  location. 
Such  bias  is  called  constant  error  (CE).  CE  depends  mainly  on  the  symmetry  of  the  auditory 
system  of  the  listener,  the  type  and  behavior  of  the  sound  source  (e.g.,  auditory  motion 
aftereffect^^  (Grantham,  1989;  Grantham  and  Wightman,  1979;  Jones  and  Bunting,  1949; 
Kagerer  and  Contreras-Vidal,  2009;  Recanzone,  1998),  and  the  acoustic  conditions  of  the 
surrounding  space.  It  also  depends  on  the  familiarity  of  the  listener  with  the  listening  conditions 
and  on  some  non-acoustic  factors,  such  as  uneven  lighting  in  the  room.  Some  potential  bias  may 
also  be  introduced  by  the  reported  human  tendency  to  misperceive  the  midpoint  of  the  angular 
distance  between  two  horizontally  distinct  sound  sources.  Several  authors  have  reported  the 
midpoint  to  be  located  1°  to  2°  rightward  (Cusak  et  ah,  2001;  Dufour  et  ah,  2007;  Sosa  et  ah. 


^^Auditory  motion  aftereffect  (AMA)  is  a  sensation  caused  by  long-term  or  repeated  listening  to  a  sound  source  moving  in 
one  specific  direction.  After  such  exposure,  a  stationary  sound  source  is  perceived  as  moving  in  the  direction  opposite  to  the 
movement  of  the  previous  sound  source.  According  to  Neelon  and  Jenison  (2004),  AMA  is  not  symmetrical  and  is  much 
stronger  when  the  adaptor  (preceding  sound)  moves  toward  the  listener’s  midline.  The  existence  of  AMA  is  considered  to  be 
evidence  of  channel-based  spatial  coding  of  information  (Hyams  and  Carlile,  1996). 


35 


2010),  although  this  shift  may  be  modulated  by  listener  handedness.  For  example,  Ocklenburg  et 
al.  (2010)  observed  a  rightward  shift  in  sound  source  localization  for  left-handed  listeners  and  a 
leftward  shift  for  right-handed  listeners  (see  appendix  A). 

Another  type  of  error  is  introduced  by  both  listener  uncertainty/imprecision  and  random  changes 
in  the  listening  conditions.  This  error  is  called  random  error  (RE).  The  size  of  RE  depends 
primarily  on  fluctuations  in  the  listener’s  attention,  differences  between  listeners,  and  the 
stability  and  clarity  of  the  signal  emitted  by  the  sound  source.  In  addition,  both  CE  and  RE 
depend  on  the  data  collection  methodology,  especially  the  form  of  the  listener’s  response  (e.g., 
direct  or  indirect  pointing,  verbal  identification,  etc.).  These  forms,  called  response  techniques, 
are  discussed  in  section  8. 


Therefore,  EE  can  be  considered  as  being  composed  of  two  error  components  with  different 
underlying  causes:  CE  resulting  from  a  bias  in  the  listener  and/or  environment  and  RE  resulting 
from  the  inherent  variability  of  listener  perception  and  listening  conditions.  If  EE  is  described  by 
a  normal  distribution,  CE  is  given  by  the  difference  between  the  true  sound  source  location  {rj) 
and  the  mean  of  the  distribution  (p)  and  RE  is  characterized  by  the  standard  deviation  (a)  of  the 
distribution.  In  the  case  of  experimental  data,  these  values  are  estimated  by  the  sample  mean  {xo) 
and  sample  standard  deviation  (SD). 

The  concepts  of  CE  and  RE  can  be  equated,  respectively,  with  the  concepts  of  precision  and 
accuracy  of  a  given  set  of  measurements,  although  a  variety  of  terms  are  used  in  the  literature  to 
convey  these  meanings  (e.g.,  Middlebrooks,  1999ab).  The  definitions  of  both  these  terms,  along 
with  common  synonyms  (although  not  always  used  correctly),  are  given  below: 

Accuracy  (constant  error,  systematic  error,  validity,  bias)  is  the  measure  of  the  degree  to 
which  the  measured  quantity  is  the  same  as  its  actual  value. 

Precision  ( random  error,  repeatability,  reliability,  reproducibility,  blur)  is  the  measure  of 
the  degree  to  which  the  same  measurement  made  repeatedly  produces  the  same  results. 


The  relationship  between  accuracy  and  precision  and  the  normal  distribution  from  figure  10  are 
shown  in  figure  1 1 . 
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Figure  1 1 .  Concepts  of  accuracy  and  precision  in 
localization  judgments. 
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The  terms  “localization  accuracy”  and  “precision”  are  normally  used  to  characterize  the  type  of 
error,  while  the  terms  “CE”  and  “RE”  are  used  in  reference  to  the  size  and  value  of  the  error.  RE 
is  usually  expressed  as  the  SD  of  the  distribution  of  the  localization  judgments.  There  are  also 
several  other  metrics  that  can  be  used  to  assess  RE  (see  section  6),  but  SD  is  the  most  common 
RE  metric  used  in  the  literature.  CE  can  be  expressed  either  as  the  difference  between  the  mean 
perceived  location  {xo)  and  the  true  position  of  the  target  (?/),  which  is  termed  mean  error  (ME): 

ME  ^  Xg  -  7]  ,,  (4) 


or  as  the  mean  error  normalized  by  the  SD  of  the  responses 


SD 


(5) 


where  A  is  the  relative  ME.  This  second  definition  can  be  interpreted  as  the  relative  CE,  that  is, 
the  ratio  of  CE  to  RE,  and  is  a  useful  metric  of  the  relative  contribution  of  both  types  of  errors  to 
overall  EE. 


Overall  EE,  which  is  sometimes  denoted  in  the  literature  by  the  letter  D  (Hartmann,  1983a; 
Hartman  et  ah,  1998;  Grantham,  1995;  Rakerd  and  Hartmann,  1986),  is  given  by  the  square  root 
of  the  sum  of  the  squares  of  CE  and  RE: 


LE  =  \lcE^  +  RE^ 


(6) 


It  can  also  be  expressed  in  terms  of  the  goodness-of-fit  (GoE)  criterion  (Bolshev,  2002)  as 


EGoF  = 


1 

,J(CE^+RE^) 


(7) 


Eocalization  goodness  of  fit  (EGoE)  is  a  convenient  coefficient  to  capture  the  average  deviation 
of  the  actual  localization  judgments  made  over  a  range  of  angular  sound  source  positions. 

The  main  problem  with  using  overall  EE  or  EGoE  metrics  is  that  theses  metrics  combine  two 
very  different  types  of  localization  errors,  which,  when  added  together,  are  seldom  indicative  of 
anything  meaningful.  Separate  calculations  and  analyses  of  CE  and  RE  are  almost  always  more 
useful  for  data  interpretation  and  are  preferred  for  data  reporting.  If,  for  some  reasons,  the 
overall  EE  is  needed,  it  should  always  be  reported  together  with  the  respective  CE  and  RE 
values. 


All  the  localization  error  metrics  discussed  previously  -  D  (overall  EE),  EGoE,  ME,  A,  and  SD  - 
can  be  used  in  assessing  EE  for  one  specific  sound  source  location  or  across  a  range  of  angular 
locations  by  simple  spatial  averaging,  although  only  the  use  of  RE  metrics  makes  practical  sense 
in  the  latter  case.  In  addition,  since  RE  is  dependent  on  the  angle  of  incidence  of  the  arriving 
sound,  averaging  across  several  sound  source  locations  makes  the  most  sense  when  two  or  more 
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listening  conditions,  sound  sources,  or  groups  of  listeners  are  being  compared  rather  than  as  an 
indicator  of  the  absolute  RE  for  one  specific  listening  situation. 

6.  Linear  Statistical  Measures 


The  two  fundamental  classes  of  measures  describing  probability  distributions  are  measures  of 
central  tendency  and  measures  of  dispersion.  Measures  of  central  tendency,  also  known  as 
measures  of  location,  characterize  the  central  value  of  a  distribution.  Measures  of  dispersion, 
also  known  as  measures  of  spread,  characterize  how  spread  out  the  distribution  is  around  its 
central  value.  In  general,  distributions  are  described  and  compared  on  the  basis  of  a  specific 
measure  of  central  tendency  in  conjunction  with  a  specific  measure  of  spread. 

6.1  Normal  (Gaussian)  Distribution 

For  the  normal  distribution,  the  mean  (ji),  a  measure  of  central  tendency,  and  the  standard 
deviation  (cr),  a  measure  of  dispersion,  serve  to  completely  describe  (parameterize)  the 
distribution.  There  is,  however,  no  practicable  way  of  directly  determining  the  true,  actual 
values  of  these  parameters  for  a  normal  distribution  that  has  been  postulated  to  characterize  some 
population  of  judgments,  measurements,  etc.  Thus,  these  parameters  must  be  estimated  on  the 
basis  of  a  representative  sample  taken  from  the  population.  The  sample  arithmetic  mean  (xo)  and 
the  sample  standard  deviation  are  the  standard  metrics  used  to  estimate  the  mean  and  standard 
deviation  of  the  underlying  normal  distribution.  The  sample  arithmetic  mean  represents  the 
center  of  gravity  of  all  the  numeric  judgments.  The  sample  standard  deviation,  introduced  by 
Pearson  (1894)  to  assess  the  degree  of  data  concentration,  is  the  square  root  of  the  average  of  the 
squared  deviations  of  the  judgments  from  their  arithmetic  mean 


where  Xi  is  the  numeric  value  of  the  i-th  judgment,  Xo  is  the  arithmetic  mean  of  all  the  judgments, 
and  n  is  the  number  of  judgments. 

The  sample  arithmetic  mean  is  an  estimate  of  the  mean  value  of  the  population.  However,  the 
goodness  of  this  estimation  depends  on  the  size  of  the  sample.  The  smaller  the  standard 
deviation  of  the  sample  (SD)  and  the  larger  the  sample  size  (n),  the  better  the  estimate.  Thus,  a 
ratio  of  these  two  parameters,  called  standard  error  (SE),  and  defined  as 


(9) 
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is  used  to  estimate  how  good  an  estimate  of  the  mean  value  of  the  population  is  to  a  specific 
sample  arithmetic  mean^^. 

If  the  data  in  the  sample  are  normally  distributed,  the  arithmetic  mean  Xo  and  standard  error  SE 
can  be  used  to  calculate  confidence  intervals  for  the  population  mean.  A  p  confidence  interval  is 
a  range  of  values,  within  which  the  unknown  population  parameter  will  be  contained  with 
probability  p  (Yaremko  et  al.  1986).  The  most  common  p  value  in  statistical  analysis  is  p  =  0.95, 
which  defines  the  95%  upper  and  lower  limits  for  the  parameter.  Since  1.96  is  the  97.5 
percentile  of  the  normal  distribution,  the  limits  of  the  95%  confidence  interval  for  the  mean  can 
be  calculated  as 


Upper  95%  limit  =  Xo+  1.96  SE 
Eower  95%  limit  =  Xo  -  1.96  SE. 

The  sample  mean  and  standard  deviation  are  highly  influenced  by  outliers  (extreme  values)  in 
the  data  set.  This  is  especially  true  for  smaller  sample  sizes.  Measures  that  are  less  sensitive  to 
the  presence  of  outliers  are  referred  to  as  robust  measures  (Huber  and  Ronketti,  2009). 
Unfortunately,  many  robust  measures  are  not  very  efficient,  which  means  that  they  require  larger 
sample  sizes  for  reliable  estimates.  In  fact,  for  normally  distributed  data  (without  outliers),  the 
sample  mean  and  standard  deviation  are  the  most  efficient  estimators  of  the  underlying 
parameters. 

A  very  robust  and  relatively  efficient  measure  of  central  tendency  is  the  median  (MD).  The 
median  represents  the  middle  point  of  the  data  with  50%  of  the  data  lying  on  either  side  of  the 
median.  A  measure  of  dispersion  closely  related  to  the  median  is  the  median  absolute  deviation 
(MEAD),  which  is  the  median  (middle  value)  of  the  absolute  deviations  from  the  median.  One 
advantage  of  the  MD  and  MEAD  over  Xo  and  SD  is  that  the  former  are  distribution  free  and  do 
not  need  any  assumption  about  the  nature  of  the  general  population  (Gorard,  2005).  Similarly  to 
the  median,  the  MEAD  is  also  a  very  robust  statistical  measure  but  is,  unfortunately,  also  very 
inefficient. 

A  more  efficient  measure  of  dispersion  that  is,  however,  not  quite  as  robust  is  the  mean  absolute 
deviation  (MAD),  which  is  the  average  of  the  absolute  deviations  from  the  mean.  Note  that  the 
abbreviation  “MAD”  is  used  in  other  publications  to  refer  to  either  of  the  median  and  mean 
absolute  deviations  (here,  MEAD  and  MAD),  which  can  cause  some  additional  confusion.  The 
formulas  for  both  the  standard  and  robust  sample  measures  discussed  previously  are  given  in 
table  1 .  They  represent  the  basic  measures  used  in  calculating  EE  when  traditional  statistical 
analysis  is  performed. 


1 7 

^'More  formally,  SE  is  an  estimator  of  the  true  standard  deviation  of  the  distribution  of  the  sample  means  of  samples  of  size  n 
taken  from  the  general  population. 
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Table  1.  Basic  measures  used  to  estimate  the  parameters  of  a  normal  distribution. 


Measure  Name 

Symbol 

Definition/F  ormula 

Comments 

Arithmetic  mean 

Xo 

1  n 

^0  =  - 

n  i=l 

— 

Sample 

standard  deviation 

SD 

SD  =  J-'E(x--x^)^ 

^  n  i=l 

V  (variance)  =  SD^. 

Median 

MD 

middle  value  of  responses 

— 

Median  absolute 
deviation 

MEAD 

middle  value  of  the  absolute 
deviations  from  the  median 

— 

Mean  absolute 
deviation 

MAD 

1  n 

MAD  =  -j:  lx-  -Xq  I 
n  i=l 

— 

Note:  The  formula  for  SD  listed  in  the  table  is  the  biased  sample  SD  formula.  To  provide  an  unbiased 
estimate  of  the  population  a  the  factor  1/n  should  be  replaced  by  l/(n-l). 


Strictly  speaking,  the  sample  median  estimates  the  population  median,  which  is  the  midpoint  of 
the  distribution,  i.e.,  half  the  values  (from  the  distribution)  are  below  it  and  half  are  above  it. 

The  median  together  with  the  midpoints  of  the  two  halves  of  the  distribution  on  either  side  of  the 
median  divide  the  distribution  into  four  parts  of  equal  probability.  The  three  dividing  points  are 
called  the  1st,  2nd,  and  3rd  quartiles  (Ql,  Q2,  and  Q3),  with  the  2nd  quartile  simply  being 
another  name  for  the  median.  Since  the  normal  distribution  is  symmetric  around  its  mean,  its 
mean  is  also  its  median,  and  so  the  sample  median  can  be  used  to  directly  estimate  the  mean  of  a 
normal  distribution. 

The  median  absolute  deviation  of  a  distribution  does  not  coincide  with  its  standard  deviation, 
thus  the  sample  median  absolute  deviation  does  not  give  a  direct  estimate  of  the  population 
standard  deviation.  However,  in  the  case  of  a  normal  distribution,  the  median  absolute  deviation 
corresponds  to  the  difference  between  the  3rd  and  2nd  quartiles,  which  is  proportional  to  the 
standard  deviation.  Thus,  for  a  normal  distribution,  the  relationship  between  the  standard 
deviation  and  the  MEAD  is  given  by  (Goldstein  and  Taleb,  2007): 

cr  »  1.4826(G3  -  Q2)  =  1.4826(MEAD)  .  (10) 


6.2  Skew  and  Kurtosis 

Skew  (skewness)  and  kurtosis  are  two  parameters  of  a  data  distribution  that  characterize  its 
departure  from  a  normal,  bell-like  distribution.  They  are  seldom  used  in  auditory  localization 
studies  but  are  useful  in  quantifying  the  deviation  from  normality  of  a  set  of  localization 
judgment  due  to  poorly  controlled  experimental  conditions  that  may  change  over  time  or  a  lack 
of  uniformity  (or  normality)  in  the  listener  panel  participating  in  the  study.  They  are  especially 
useful  is  assessing  hard-to-quantify  effects  of  environmental  changes  (e.g.,  wind  strength  and 
direction)  on  localization  data  collected  in  the  open  field. 

Skew  (S)  is  a  measure  of  the  lack  of  symmetry  and  was  originally  defined  by  Pearson  (Pearson, 
1894;  1895;  Stuart  and  Ord,  1994;  Wuensch,  2005)  as 


40 


(11) 


3(Xo-MD) 

‘j - j 

SD 

where  Xo  is  the  arithmetic  mean,  MD  is  the  median,  and  SD  is  the  standard  deviation,  but  was 
later  on  re-defined  by  Fisher  (1925)  as 

n  3 

t  (X:  -X^  r 

S=^ - 3-’  (12) 

(n-l)SD^ 

where  Xo  is  the  arithmetic  mean,  Xi  is  an  /-th  value  in  the  sample,  SD  is  the  standard  deviation, 
and  n  is  the  sample  size.  To  differentiate  between  these  two  concepts  of  skew,  they  are 
sometimes  called  Pearson 's  skew  (SP)  and  Fisher 's  skew  (SF),  respectively.  A  normal 
distribution  has  S=0.  Skew  is  negative  if  more  of  the  data  are  on  the  right  side  and  the 
distribution  has  a  longer  left  tail.  Such  a  distribution  (or  sample)  is  called  left-skewed.  Skew  is 
positive  if  more  of  the  data  are  on  the  left  side  and  the  distribution  has  a  longer  right  tail.  Such  a 
distribution  (or  sample)  is  called  right-skewed.  Note  that  skewed  data  can  be  normalized  (made 
symmetrical)  using,  for  example,  the  Box-Cox  transformation  (Box  and  Cox,  1964)  or  other 
nonlinear  transformations,  which  can  be  found  in  most  statistical  handbooks  and  standard 
statistical  software  packages.  An  important  use  of  these  transformations  is  in  outlier  detection  in 
skewed  data  sets,  since  these  are  easier  to  identify  in  normalized  distributions.  For  further 
discussion  of  outliers  see  section  10.2. 

Skew  as  defined  in  equations  1 1  and  1 12  represents  the  sample  skew.  These  values  can  be  used 
as  biased  estimators  of  the  underlying  skew,  but  they  do  not  work  well  for  small  sample  sizes 
(n).  For  small  samples,  an  unbiased  estimator  of  population  skew  is  given  by  (Joannes  and  Gill, 
1998) 


n-2 


The  standard  error  of  skew  (SES)  can  be  calculated  as  (Cramer,  1997,  p.85;  Tabachnick  and 
Fidell,  1996) 


Note  that  the  SES  is  only  a  function  of  n,  and  it  does  not  depend  on  any  aspect  of  the  shape  of 
the  distribution  (Wright  and  Herrington,  201 1).  If  the  absolute  value  of  the  skew  is  twice 
(actually  1.96)  as  large  as  the  SES  or  greater,  the  distribution  is  considered  skewed  (95% 
confidence  interval  of  population  skew).  This  means  that  the  distribution  is  considered  skewed  if 
its  skew  is  significantly  different  from  0.  The  common  approximation  provided  on  the  right  side 
of  equation  14  (e.g.,  Eidell  and  Tabachnick,  2003,  p.  1 17)  is  only  valid  for  large  n. 
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Kurtosis  (K),  from  Greek  word  kyrtos  meaning  bulging,  is  a  measure  of  the  “sharpness”  of  the 
distribution  and  is  calculated  as 

n  4 

i  {x-  -Xg) 

- 4^  ,  (15) 

{n-l)SD 

where  Xo  is  the  arithmetic  mean,  xi  is  an  i-th  value  in  the  sample,  SD  is  the  standard  deviation, 
and  n  is  the  sample  size.  The  kurtosis  of  the  normal  distribution  is  3;  therefore,  equation  15  is 
frequently  adjusted  by  this  value  to  be 

n  4 

^{Xi-Xo) 

- 4^-3,  (16) 

(n  -  l)SD 

SO  that  the  normal  distribution  has  a  kurtosis  of  0.  This  normalized  kurtosis  is  sometime  referred 
to  as  excess  kurtosis  and  can  vary  from  -2  to  +oo.  Distributions  (or  samples)  with  negative 
kurtosis  (low  kurtosis)  are  called  platykurtic  (flat),  those  without  kurtosis  (K  =  0)  are  called 
mesokurtic,  and  those  with  positive  kurtosis  (high  kurtosis)  are  called  leptokurtic  (sharp). 
Platykurtic  (flat)  distributions  have  a  flatter  top  and  shorter  and  thinner  tails  while  leptokurtic 
(sharp)  distributions  have  a  sharp  top  but  longer  and  wider  tails.  Skewed  distributions  are  always 
sleptokurtic  (Hopkins  and  Weeks,  1990).  An  unbiased  estimator  of  underlying  kurtosis  that 
works  well  for  any  n  can  be  calculated  as  (Joanne  and  Gill,  1998) 

Ku  = - - [{n  +  \)K  +  6]  .  (17) 

(n  -2)(n  -  3) 

The  standard  error  of  kurtosis  (SEK)  can  be  calculated  as  (Cramer,  1997,  p.89) 


Note  that  the  SEK,  similarly  to  the  SES,  is  only  a  function  of  n  and  does  not  depend  on  any 
aspect  of  the  shape  of  the  distribution  (Wright  and  Herrington,  2011).  If  K  differs  from  0  by  two 
or  more  SEKs,  the  distribution  cannot  be  considered  to  be  mesokurtic  (95%  confidence  interval 
of  population  kurtosis).  This  means  that  the  distribution  is  considered  platykurtic  or  leptokurtic 
if  its  kurtosis  is  different  from  0  with  95%  probability.  The  approximation  given  on  the  right 
side  of  the  equation  18  is  again  only  valid  for  large  n. 

6.3  Localization  Error  Metrics 

Skew  and  kurtosis  are  effective  measures  of  how  far  specific  sample  distributions  characterized 
by  the  parameters  in  table  1  depart  from  the  ideal  normal  distribution.  The  parameters  listed  in 
table  1  are  also  the  basis  for  calculating  EE,  CE,  and  RE.  The  main  metrics  used  in  calculating 
EE  and  the  related  formulas  are  listed  in  table  2. 
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Table  2.  Basic  metrics  used  to  calculate  localization  error  {t]  denotes  true  location  of  the  sound 
source). 


Metric  Name 

Symbol 

Type 

Definition/F  ormula 

Comments 

Mean  Error 
(Mean  Signed  Error) 

ME 

CE 

ME  =  -  Z  (x-  -Ji)  =  x  - 
n  i=l  '  ® 

— 

Mean  Absolute  Error 
(Mean  Unsigned 

Error) 

MUE 

CE 

and 

RE 

1  n 

MUE  =  -  I  \x:  -77 1 
n  i=l 

|ME|  <  MUE  < 

IMEI+  MAD 

Root-Mean-Squared 

Error 

RMSE 

CE 

and 

RE 

[1  n  2 

RMSE=  -Z  (x-  -77) 

V  n  7=1 

RMSE^=ME^+ 

SD^; 

Sample  Standard 
Deviation 

SD 

RE 

ll  n  2 

SD  =  -  Z  (v-  -Xq) 

V  n  7=1 

— 

Mean  Absolute 
Deviation 

MAD 

RE 

1  n 

MAD  =  -  Z  lx-  -Xq  I 

77  7=1 

MAD  =  1.18 

MEAD 

Note:  The  formula  for  SD  listed  in  the  table  is  the  biased  sample  SD  formula.  To  provide  an  unbiased 
estimate  of  the  population  a  the  factor  1/n  should  be  replaced  by  l/(n-l). 


The  formulas  listed  in  table  2  and  discussed  previously  apply  to  normal  or  similar  unimodal 
distributions.  In  the  case  of  a  multimodal  data  distribution,  these  metrics  are  in  general  not 
applicable.  However,  if  there  are  only  a  few  modes  that  are  relatively  far  apart,  then  these 
metrics  (or  similar  statistics)  can  be  calculated  for  each  of  the  modes  using  appropriate  subsets  of 
the  data  set.  This  is  in  particular  applicable  to  the  analysis  of  reversal  errors,  which  tend  to 
define  a  separate  unimodal  distribution  (see  section  10). 

SD  is  the  standard  metric  for  RE,  while  the  standard  metric  for  CE  is  the  ME,  also  called  mean 
bias  error,  which  is  equivalent  to  the  difference  between  the  sample  mean  of  the  localization  data 
(xo)  and  the  true  location  of  the  sound  source.  The  unsigned,  or  absolute,  counterpart  to  the  ME, 
the  mean  unsigned  error  (MUE)  is  a  metric  of  total  EE  as  it  represents  a  combination  of  both  the 
CE  and  the  RE.  The  MUE  was  used  among  others  by  Searle  et  al.  (1975;  1976)  and  Makous  and 
Middlebrooks  (1990)  in  analyzing  their  data.  Another  error  metric  that  combines  the  CE  and  RE 
is  the  root  mean  squared  error  (RMSE).  The  relationship  between  these  three  metric  is  given  by 
the  following  inequality,  where  n  is  the  sample  size  (Willmott  and  Matusuura,  2005). 

\me\  <  MUE  <  RMSE  <  4nMUE  .  (19) 

Eor  example,  Erickson  et  al.  (1991)  reported  average  MUE  =  6.3°  and  ME  =  1.31°  over  a  variety 
of  wide-  and  octave-band  stimuli  presented  at  the  Air  Force’s  Auditory  Localization  Facility 
(ALE). 

The  RE  part  of  the  RMSE  is  given  by  the  sample  standard  deviation,  but  the  RE  in  the  MUE 
does  not  in  general  correspond  to  any  otherwise  defined  metric.  However,  if  each  localization 
estimate  is  shifted  by  the  ME  so  as  to  make  the  CE  equal  to  zero,  the  MUE  of  the  data 
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normalized  in  this  way  is  reduced  to  the  sample  MAD.  Since  the  MAD  is  not  affected  by  linear 
transformations,  the  MAD  of  the  normalized  data  is  equal  to  the  MAD  of  the  non-normalized 
localizations  and  so  represents  the  RE  of  the  localizations.  Thus,  the  MAD  is  also  a  metric  of 
RE.  Eor  a  normal  distribution,  the  standard  deviation  is  proportional  to  the  mean  absolute 
deviation  in  the  following  ratio  (Goldstein  and  Taleb,  2007): 

cr  =  »  1 .253{MAD) .  (20) 

This  means  that  for  sufficiently  large  sample  sizes  drawn  from  a  normal  distribution,  the 
normalized  MUE  (=MAD)  will  be  approximately  equal  to  0.8  times  the  SD.  The  effect  of 
sample  size  on  the  standard  deviation  of  the  ratio  of  the  sample  MAD  and  the  (uncorrected) 
sample  SD  for  samples  from  a  normal  distribution  is  shown  below  in  figure  12.  It  shows  that  for 
sample  sizes  larger  than  50  the  potential  error  in  determining  a  from  MAD  should  not  exceed 
0.03,  that  is,  4%. 


Figure  12.  The  standard  deviation  of  the  ratios  between  sample  MAD  and  sample  SD  for  sample 
sizes  10  to  100  generated  1000  times  each  plotted  against  the  size  of  the  sample. 

6.4  Issues  Associated  with  the  Application  of  the  Mean  and  Standard  Deviation 

The  common  primacy  of  the  sample  arithmetic  mean  and  sample  standard  deviation  for 
estimating  the  population  parameters  is  based  on  the  assumption  that  the  distribution  is  unimodal 
and  relatively  symmetrical.  This  is  frequently  not  the  case  with  human  experiments,  which  have 
numerous  potential  sources  for  data  contamination  resulting  from  such  factors  as  an  unbalanced 
composition  of  the  listening  panel  and  inconsistent  concentration  of  the  listeners.  In  general, 
data  collected  in  such  experiments  show  more  values  farther  away  from  the  mean  than  expected 
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(heavier  tails  or  greater  kurtosis),  are  more  likely  to  be  multimodal,  and  contain  more  extreme 
values  (outliers),  especially  for  smaller  data  sets. 

It  is  generally  desired  that  a  small  number  of  extreme  cases  should  not  overly  affect  the 
conclusions  based  on  the  data.  Unfortunately,  this  may  not  be  the  case  with  the  sample  mean 
and  standard  deviation.  As  mentioned  earlier,  the  mean  and,  in  particular,  the  standard  deviation 
are  quite  sensitive  to  outliers  (the  inaccurate  results).  Their  more  robust  counterparts  discussed  in 
this  section  are  a  way  of  dealing  with  this  problem  without  having  to  specifically  identify  which 
results  constitute  the  outliers,  as  is  done  in  trimming  and  winsorizing  (see  section  10). 

Moreover,  the  greater  efficiency  of  the  sample  SD  over  the  MAD  disappears  with  only  a  few 
extreme  cases  in  a  large  sample  (Huber  and  Ronchetti,  2009).  Thus,  since  there  is  a  high  chance 
of  the  underlying  distribution  not  being  perfectly  normal,  the  use  of  more  robust  metrics  for 
estimating  the  CE  (mean)  and  RE  (standard  deviation)  may  be  recommended. 

It  is  also  recommended  that  both  components  of  localization  error,  CE  and  RE,  always  be 
reported  individually.  A  single  compound  metric  of  error  such  as  the  RMSE  or  MUE  is  not 
sufficient  for  understanding  the  nature  of  the  errors.  These  compound  metrics  can  be  useful  for 
describing  total  EE,  but  they  should  be  treated  with  caution.  Opinions  as  to  whether  RMSE  or 
MUE  provides  the  better  characterization  of  total  EE  are  divided,  although  MUE  seems  to  be 
more  commonly  used  (e.g.,  Wenzel  et  al.  1993;  Wightman  and  Kinsler,  1989b).  The  overall 
GoE  measure  given  in  equation  7  clearly  uses  RMSE  as  its  base.  Some  authors  consider  RMSE 
as  “the  most  meaningful  single  number  to  describe  localization  performance”  (Hartmann,  1983a, 
p.  1382)  and  as  the  type  of  metric,  which  “in  addition  to  being  sensitive  to  information  across  all 
locations  ...  is  also  sensitive  to  a  wide  range  of  changes  in  the  target-response  relationship” 
(Aronoff  et  al.  2010,  p.  EE90).  However,  others  argue  that  MUE  is  a  better  measure  than 
RMSE.  Their  criticism  of  RMSE  is  based  on  the  fact  that  RMSE  includes  MUE  but  is 
additionally  affected  by  the  square  root  of  the  sample  size  and  the  distribution  of  the  squared 
errors  which  confounds  its  interpretation  (Willmott  and  Matusuura  2005). 


7.  Spherical  Statistics 


Spherical  statistics,  called  also  directional  statistics,  is  a  special  branch  of  statistics  providing  a 
set  of  mathematical  tools  developed  to  analyze  directions  in  space  or  positions  of  points  on  the 
surface  of  a  sphere.  Such  tools  are  used  in  several  areas  of  science  including  astronomy, 
geophysics,  geography,  and  biological  sciences.  Although  there  were  several  attempts  to 
account  for  sphericity  of  spatially  distributed  data  in  the  past,  the  science  of  spherical  statistics 
was  started  as  late  as  1953  by  R.  A.  Eisher  (1953),  who  mathematically  described  distribution  of 
angular  errors  and  provided  methodology  to  calculate  basic  statistical  parameters  (mean 
direction,  measure  of  dispersion)  describing  such  distribution.  Since  localization  data  are 
angular  data,  spherical  statistics  has  to  be  used  in  general  case  to  describe  such  data. 
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7.1  Limitations  of  Linear  Statistics  when  Applied  to  Sound  Localization  Data 

The  analysis  of  localization  data  using  linear  statistics  is  complicated  by  the  fact  that  the 
potential  locations  of  sound  sources  around  a  listener  form  a  continuous  and  circular  area.  The 
traditional  statistical  methods  discussed  earlier  were  developed  for  linear  distributions  extending 
from  negative  to  positive  infinity.  These  tools  are  not,  in  general,  appropriate  for  the  analysis  of 
circular  data,  such  as  angles,  which  wrap  around  a  circle.  The  circular  scale  can  be  considered  a 
special  case  of  an  interval  scale  with  no  natural  zero  point  and  no  natural  designation  of  high  or 
low  values  (Zar,  1999).  The  fundamental  reason  linear  statistics  is  not  appropriate  for  circular 
data  is  that  if  the  numerical  difference  between  two  angles  is  greater  than  180°,  then  their  linear 
average  will  point  in  the  opposite  direction  from  their  actual  mean  direction.  For  example,  the 
mean  direction  of  0°  and  360°  is  actually  0°,  but  the  linear  average  is  180°.  Even  if  the 
differences  between  angles  are  smaller  than  180°  but  larger  than  90°  their  arithmetic  average  has 
a  tendency  to  be  orthogonal  to  the  main  axis  of  the  distribution.  The  result  of  linear  averaging  of 
two  angles  is  additionally  dependent  on  the  way  the  angles  are  measured  (see  section  4).  For 
example,  the  average  of  the  two  angles  expressed  as  90°  and  270°  in  the  360°  notation  scheme  is 
180°,  but  the  average  of  the  same  two  angles  in  the  +180°  (+90°  and  -90°)  notation  scheme  is  0°. 
Since  statistical  analysis  relies  on  being  able  to  sum  data  points,  it  is  clear  that  something  other 
than  standard  addition  must  serve  as  the  basis  for  the  statistical  analysis  of  angular  data.  The 
simple  solution  comes  from  considering  the  angles  as  vectors  of  unit  length  and  applying  vector 
addition.  This  vector  summation  of  angular  data  is  the  basis  of  spherical  statistics,  which 
provides  a  basic  set  of  tools  for  the  analysis  of  circular  data. 

Spherical  statistics  is  a  set  of  analytical  methods  specifically  developed  for  the  analysis  of 
probability  distributions  on  spheres.  Distributions  on  circles  (two-dimensional  spheres)  are 
handled  by  a  subfield  of  spherical  statistics  called  circular  statistics.  Spherical  (circular)  data 
distributions  differ  from  linear  distributions  and  need  to  be  described  differently.  A  circular 
distribution  is  a  probability  distribution  whose  total  probability  is  confined  within  the 
circumference  of  a  circle  (Rao  Jammalamadaka  and  SenGupta,  2001).  An  inherent  problem  with 
considering  a  linear  normal  distributions  statistically  in  a  circular  space  is  that  the  former  is 
defined  on  an  unbounded  domain  (-oo,  -too),  while  the  latter  is  defined  on  a  bounded  domain 
(-180°,  +180°).  Therefore  there  is  a  real  risk  that  non-zero  probabilities  on  the  normal 
distribution  will  fall  outside  of  this  range  and  must  be  dropped.  For  example,  when  a  linear 
normal  distribution  with  SD=130°  is  wrapped  around  a  circle,  almost  20%  of  the  data  wraps  on 
top  of  itself  (Cain,  1989).  Only  when  the  linear  variance  of  the  circular  data  is  sufficiently  small 
(as  discussed  further)  or  when  the  whole  data  set  is  mostly  confined  to  a  +90°  range  around  a 
central  point  can  angular  data  be  analyzed  as  coming  from  a  linear  distribution.  Under  these 
conditions,  the  linear  distribution  fits  almost  in  its  entirety  onto  the  circumference  of  the  circle 
without  overlap,  and  the  large  errors  can  be  assumed  to  be  outliersi^.  In  the  case  of  spherical 
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Front-back  errors  are  a  special  class  of  errors  and  are  not  considered  here  since  they  require  a  separate  analysis. 
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angles,  the  data  points  are  spread  across  the  surface  of  the  sphere,  and  linear  statistics  is  not 
appropriate  for  summing  (averaging)  horizontal  angles  across  various  elevations  or  vice  versa. 

7.2  Theoretical  Foundations 

In  the  spherical  analysis  of  auditory  localization  data,  the  data  are  vectors  that  indicate  directions 
and  have  no  meaningful  magnitudes  unless  the  judgments  involve  distance  estimation  (an 
extremely  rare  situation).  Thus,  in  the  general  case,  localization  data  can  be  represented  by 
vectors  of  unit  (1)  length  each  having  the  same  point  of  origin  and  terminating  at  the  surface  of 
the  unit  sphere  centered  at  the  point  of  origin.  Each  vector  can  be  described  by  its  declination 
(azimuth)  and  inclination  (elevation),  which  represent  projections  of  the  spherical  angle  onto  the 
horizontal  and  vertical  planes,  respectively.  If  the  auditory  localizations  are  limited  to  the 
horizontal  plane,  the  Cartesian  coordinates  X  and  Y  of  the  mean  vector  of  a  set  of  judgments  (unit 
vectors)  corresponding  to  specific  planar  angles  6  about  the  origin  are  given  by 

Z=-Esin(6'-)  (21) 

n  i=\ 


and 


1 

F  =  —  Z  cos(d. ) 
n  i=l  ' 


(22) 


The  angle  do  that  the  mean  vector  makes  with  the  x-axis  is  the  mean  angular  direction  of  all  the 
angles  in  the  data  set.  Its  calculation  depends  on  the  quadrant  the  mean  vector  is  in  (Rao 
Jammalamadaka  and  SenGupta,  2001): 


arctan(x/F) 

X  >0,F  >0 

arctan  (x/f)  +  ;r 

F  <  0 

arctan  (f/f)  +  2n 

X  <0,Y>0 

nil 

X  >o,r  =  0 

X  <  0,F  =  0 

(23) 


This  angle  is  frequently  called  the  judgment  centroid  and  is  represented  by  the  unit  vector  with 
this  angle.  When  X  =  0  and  Y  =  0,  the  judgment  centroid  is  undefined.  The  magnitude  of  the 
mean  vector  is  called  the  mean  resultant  length  (R)^^  and  is  calculated  as 


R  = 


(24) 
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In  some  books,  the  mean  resultant  length  is  denoted  by  “r,”  and  R=nr,  where  n  is  the  number  of  judgments. 
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R  is  a  measure  of  concentration,  the  opposite  of  dispersion,  which  plays  an  important  role  in 
defining  the  circular  standard  deviation.  Its  magnitude  varies  from  0  to  1^0  with  R  =  1  indicating 
that  all  the  vectors  (angles)  in  the  set  point  in  the  same  direction  and  R  =  0  indicating  a  uniform 
dispersion  of  the  vector  directions.  Note  that  R  =  0  not  only  for  a  set  of  angles  that  are  evenly 
distributed  around  the  circle  but  also  for  a  set  of  angles  that  are  equally  distributed  between  two 
opposite  directions.  Thus,  like  the  linear  measures  discussed  in  section  6,  R  is  most  meaningful 
for  unimodal  distributions.  However,  as  opposed  to  linear  measures,  circular  measures  are 
independent  of  the  way  the  angles  are  measured.  The  graphical  representation  of  6o  and  R  is 
shown  in  figure  13. 


Figure  13.  The  density  distribution  of  individual  judgments 
and  the  circular  metrics  Qg  and  R  of  the  judgment 
distribution.  The  radii  extending  from  the  center  of 
the  circle  to  its  circumference  represent  the 
individual  judgments  of  direction. 

In  the  case  of  spherical  (3-D)  data  sets,  the  previous  calculations  for  Oo  and  R  take  the  form: 


1  ft 

Z  =  —  Z  sin(d- )  sin(ffl. ) 
n  i=\ 

(25) 

F  =  -  Z  cosld-jsini®.) 
n  i=l  ' 

(26) 

1  n 

Z  =  -  S  cos(«0 
n  i=\ 

(27) 

0^  =  arctan(F  IX) 

(28) 

^®ln  some  books  X  and  Y  are  calculated  as  sums  (not  averages),  and  in  these  cases,  R  varies  from  0  to  n,  where  n  is  the 
number  of  judgments  (e.g.,  Rao  Jammalamadaka  and  SenGupta,  2001). 
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(Pq  =  arccos(Z) 


(29) 


R  = 


?  9  ? 

IZ  +F  +Z 


(30) 


7.3  Circular  Distributions 

There  are  a  number  of  theoretical  statistical  distributions  designed  to  represent  circular  data. 
These  include  the  uniform  distribution,  the  wrapped  normal  distribution  (Fisher  distribution),  and 
the  von  Mises  distribution.  In  the  case  of  spherical  data,  the  additional  distributions  include  the 
von  Mises-Fisher  distribution  (which  reduces  to  the  von  Mises  distribution  on  a  circle)  and  the 
bivariate  (elliptical)  circular  normal  distribution  (Kent  distribution).  One  of  these  distributions 
should  be  referred  to  when  a  theoretical  distribution  is  needed  to  characterized  circular  or 
spherical  data  (Cain,  1989).  The  differences  between  the  above  distributions  (save  uniform 
distribution)  are  rather  small,  and  the  user  can  select  whichever  distribution  is  more  convenient 
(Batschelet,  1981;  Fisher,  1987;  Mardia,  1982).  The  wrapped  normal  distribution  is  additive,  has 
simple  trigonometric  moments,  and  leads  to  tractable  measures  of  variance,  skew,  and  kurtosis. 
The  von  Mises  distribution  is  easier  to  use  in  hypothesis  testing,  maximizes  entropy,  and 
possesses  the  maximum  likelihood  property  (Cabot,  1977).  In  addition,  depending  on  the  value 
of  the  K  parameter,  the  von  Mises  distribution  (equation  31)  may  approximate  the  wrapped 
normal  distribution  (large  k)  or  reduce  to  the  uniform  distribution  (/c  =  0).  The  Kent  distribution 
is  preferable  in  spherical  data  analysis  when  the  error  distribution  on  the  sphere  has  very 
different  patterns  in  the  horizontal  and  vertical  directions  (Leong  and  Carlile,  1998). 

Since  the  wrapped  normal  distribution  and  the  von  Mises  distribution  “may  be  made  to 
approximate  each  other  very  closely,  it  is  usually  assumed  that  the  two  distributions 
approximately  share  each  other’s  properties”  (Cabot,  1977,  p.  5).  Therefore,  the  selection  of  one 
or  the  other  depends  on  the  type  of  data  and  the  research  question.  In  the  case  of  auditory 
localization  data  that  can  vary  from  uniform  to  normal  distributions  depending  on  the 
experimental  conditions,  the  circular  data  distribution  is  typically  characterized  by  the  von  Mises 
distribution  (Fisher,  1996;  Fisher  et  ah,  1987): 


1  KCOS(d-df,) 

_  ^  ^ 

iTtIgiK) 


(31) 


where  6  is  the  angle,  6o  the  mean  angle,  and  Io(k)  the  modified  Bessel  function  of  order  0: 


4('^)  = 


0 


(32) 


The  K  parameter  in  the  von  Mises  distribution,  as  well  as  other  circular  distributions,  is  not  a 
measure  of  dispersion,  like  the  standard  deviation,  but,  like  R,  is  a  measure  of  concentration.  A 
biased  estimator  of  k,  which  is  good  for  large  samples,  is  given  by  (McFadden,  1980;  Tauxe, 
2010) 
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K  K  k  = 


n-\ 
n  -  R 


(33) 


where  n  is  the  sample  size,  and  R  is  the  length  of  the  resultant  vector  (0<i?<l).  An  unbiased 
estimate  of  k  for  small  samples  (n<16)  is  given  by  (Fisher  et  ah,  1987;  Wightman  and  Kistler, 
1989b) 


^  -  («-l)  (»-l)  - 

n  (n-R)  n(n-R) 


(34) 


Frequently,  instead  of  k,  its  reciprocal  k“^  is  reported  in  scientific  literature  as  it  has  an 
interpretation  similar  to  that  of  the  variance  (Wenzel  et  ah,  1993).  With  k  =  0,  the  von  Mises 
distribution  is  equal  to  the  uniform  distribution  on  the  circle,  and  as  k  increases  the  distribution 
becomes  more  and  more  concentrated  around  its  mean.  As  k  continues  to  increases,  the  von 
Mises  distribution  begins  to  more  and  more  closely  resemble  a  wrapped  normal  distribution: 

i0-df)+27rk)^ 

/(^)=  '  I  .  2c7^  ,  (35) 

cry2;r  k=-<x> 


where  6o  and  a  are  the  mean  and  standard  deviation  of  the  linear  distribution. 


One  of  the  most  significant  differences  between  spherical  statistics  and  linear  statistics  is  that  due 
to  the  bounded  range  over  which  the  distribution  is  defined,  there  is  no  generally  valid 
counterpart  to  the  linear  standard  deviation  in  the  sense  that  intervals  defined  in  terms  of 
multiples  of  the  standard  deviation  represent  a  constant  probability  independent  of  the  value  of 
the  standard  deviation  (Fisher,  1987).  Clearly,  as  the  circular  standard  deviation  increases,  fewer 
and  fewer  standard  deviations  are  needed  to  cover  the  whole  circle. 


7.4  Circular  Standard  Deviation 

A  reasonable  approach  to  defining  the  circular  standard  deviation  would  be  to  base  it  on  the 
wrapped  normal  distribution  so  that  for  a  wrapped  normal  distribution  it  would  coincide  with  the 
standard  deviation  of  the  underlying  linear  distribution.  This  can  be  accomplished  due  to  the  fact 
that  for  the  wrapped  normal  distribution,  there  is  a  direct  relationship  between  the  mean  resultant 
length,  R  (in  radians),  and  the  underlying  linear  standard  deviation  (Cabot,  1977): 


R  =  e  ^  .  (36) 

The  above  equality  provides  the  general  definition  of  the  circular  standard  deviation  as  (Mardia, 
1972)21: 

a^=a  =  ^-21n(7?) 


21 


To  convert  the  expressions  35-37  from  radians  to  degrees  multiply  the  result  by  180°/7t. 
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If  R  >  0.82,  equation  37  can  be  approximated  with  less  than  5%  error  by  (Fisher,  1987) 


(38) 


Circular  variance  is  defined  as  in  the  linear  case  as  the  square  of  the  standard  deviation. 


The  sample  circular  mean  direction  and  sample  circular  standard  deviation  can  be  used  to 
describe  any  circular  data  set  drawn  from  a  normal  circular  distribution.  However,  if  the  angular 
data  are  within  +90°,  or  within  any  other  numerically  continuous  180°  range,  then  linear 
measures  can  still  be  used.  Since  standard  addition  applies,  the  linear  mean  can  be  calculated, 
and  it  will  be  equal  to  the  circular  mean  angle.  The  linear  standard  deviation  will  also  be  almost 
identical  to  the  circular  standard  deviation  as  long  as  the  results  are  not  overly  dispersed.  This 
can  be  seen  in  figure  14,  in  which  the  circular  standard  deviation  is  compared  to  the  linear 
standard  deviation  for  a  set  of  500  samples  of  size  10  and  100. 


Linear  Standard  Deviation  vs.  Circuiar  Standard  Deviation 
Sampie  Size:  10  (500  Sampies) 


Linear  SD 


a 


Linear  Standard  Deviation  vs.  Circular  Standard  Deviation 
Sampie  Size:  100  (500  Samples) 


Linear  SD 


b 


Figure  14.  Comparison  of  circular  and  linear  standard  deviations  for  500  samples  of  (a)  small  (n  =  10) 
and  (b)  large  (n  =  100)  size. 

The  samples  were  drawn  from  linear  normal  distributions  with  standard  deviations  randomly 
selected  in  the  range  1°  <  a  <  60°.  The  two  sample  standard  deviations  begin  to  deviate  slightly 
at  about  a  =  30°,  but  even  at  a  =  60°  the  difference  is  not  too  great  for  the  larger  sample  size.  In 
fact,  the  relationship  between  the  linear  standard  deviation  and  the  circular  standard  deviation  is 
not  so  much  a  function  of  the  range  of  data  as  of  its  dispersion.  So,  for  angular  data  that  are 
assumed  to  come  from  a  reasonably  concentrated  normal  distribution,  as  would  be  expected  in 
most  localization  studies,  the  linear  standard  deviation  can  be  used  even  if  the  data  span  the  full 
360°,  as  long  as  the  mean  is  calculated  as  the  circular  mean  angle.  It  remains  strongly  advised 
that,  as  mentioned  earlier,  localization  errors  greater  than  120°  (reversal  errors)  should  not  be 
excluded  from  the  data  set  but  should  be  analyzed  separately. 
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7.5  Other  Circular  Statistics 


Once  the  circular  mean  has  been  calculated,  the  formulas  in  table  2  in  section  6  can  be  used  to 
calculate  the  circular  counterparts  to  the  other  linear  error  metrics.  For  example,  the  spherical 
(circular)  95%  confidence  angle,  which  is  an  analog  of  the  confidence  interval  in  linear  statistics, 
can  be  calculated  (Fisher  et  ah,  1987,  p.l31).  The  determination  of  the  circular  median,  and  thus 
of  the  MEAD,  is  typically  a  much  more  involved  process.  The  problem  is  that  there  is,  in 
general,  no  natural  point  on  the  circle  from  which  to  start  ordering  the  data  set.  However,  a 
defining  property  of  the  median  is  that  for  any  data  set,  the  average  absolute  deviation  is 
minimized  when  calculated  with  respect  to  the  median,  with  deviation  being  the  length  of  the 
shorter  arc  between  each  data  point  and  the  reference  point.  Note  that  a  circular  median  does  not 
necessarily  always  exist,  as  for  example,  for  a  data  set  that  is  uniformly  distributed  around  the 
circle  (Mardia,  1972).  If  however,  the  range  of  the  data  set  is  less  than  360°  and  has  two  clear 
endpoints,  then  the  calculation  of  the  median  and  MEAD  can  be  done  as  in  the  linear  case. 

Circular  measures  of  skew  and  kurtosis  as  well  as  circular  regression  equations  can  be  also 
calculated  (e.g.,  Cabot  1977;  Mardia,  1972;  Rao  Jammalamadaka  and  SenGupta,  2001). 
However,  for  data  with  low  variability,  they  provide  results  very  similar  to  linear  measures,  and 
the  linear  measures  can  be  alternatively  used  under  the  same  conditions  as  mentioned  above  for 
the  applicability  of  linear  measures  to  circular  data  (Batschelet,  1981;  Mardia,  1972;  Zar,  1999). 


In  some  cases,  there  are  two  (or  more)  angular  variables  that  may  be  related  and  their  degree  of 
association  needs  to  be  determined.  Eor  example,  in  an  analysis  of  the  localization  judgments 
performed  in  the  open  air,  a  degree  of  association  between  the  perceived  sound  direction  and  the 
degree  of  sound  source  visibility  may  be  of  interest.  The  degree  of  association  between  two 
circular  variables  can  be  measured  using  circular  covariance  (Cabot,  1977)  or  the  circular 
correlation  coefficient,  r(x,  y),  defined  as  (Rao  Jammalamadaka  and  Sarma,  1988;  Rao 
Jammalamadaka  and  SenGupta,  2001) 


rix,  y) 
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Z  sin(x- 
i=\  ' 
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Z  sin  {x-  - 
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(39) 


where  n  is  the  number  of  data  points,  i  and  j  are  specific  data  points,  x  is  the  first  angular 
variable,  y  is  the  second  angular  variable,  and  Xo  and  yo  are  their  respective  mean  values.  The 
value  of  r(x,  y)  varies  from  -1  to  1,  where  zero  indicates  that  there  is  no  relationship  between  the 
variables,  and  +1  represents  identity  or  reversal  between  both  variables,  respectively. 

7.6  Circular  Data  Hypothesis  Testing 

Both  parametric  and  non-parametric  statistical  tests  can  be  used  to  test  hypotheses  related  to 
circular  data.  They  only  require  that  the  angular  measurements  (judgments)  are  independent 
events  (Batschelet,  1981).  The  two  basic  statistical  tests  that  are  used  to  test  for  uniformity  of 
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the  distribution  of  circular  data  are  the  nonparametric  Rayleigh  z  test  and  the  Rao  t/„  test.  The 
Rayleigh  z  test  is  used  to  determine  whether  the  data  distribution  around  a  circle  is  sufficiently 
random  to  assume  a  uniform  spread  of  judgments.  The  zero  (Ho)  and  alternate  (Hi)  hypotheses 
of  the  Rayleigh  test  are  formulated  as 

•  Hq:  the  data  distribution  has  no  mean  direction 

•  Hp  the  data  distribution  has  mean  direction. 

If  hypothesis  Ho  is  rejected,  this  means  that  the  data  set  has  a  calculable  mean  value  regardless  of 
the  underlying  distribution  The  Rayleigh  test  examines  the  length  of  the  mean  vector  R  in 
relation  to  the  size,  n,  of  the  data  set.  The  test  statistic  z  is  formed  as 

z  =  nR^  (40) 

In  the  case  of  localization  data,  n  is  the  number  of  judgments  and  R  is  the  measure  of  the  angular 
spread  of  judgments  (mean  resultant  length).  Critical  values  of  the  Rayleigh  z  value  can  be 
found  in  some  statistics  books,  e.g.,  Zar  (1999,  table  B.34). 

Note  that  the  Rayleigh  z  test  fails  when  the  distribution  is  multimodal  (e.g.,  bimodal).  Such  a 
distribution  may  be  falsely  determined  to  be  uniform  although  all  the  data  may  be  concentrated 
at  just  two  or  three  locations.  This  may  be  the  case  when  there  is  a  large  percentage  of  reversal 
errors  in  the  data  set.  Jones  and  James  (1969)  and  Mardia  (1972)  discuss  bimodal  distributions 
and  some  numeric  methods  that  can  be  used  in  describing  such  distributions.  However,  in  the 
case  of  localization  judgments,  such  descriptions  need  to  be  supplemented  by  separate  analyses 
of  both  parts  of  the  overall  distribution. 

When  all  the  angles  in  bimodal  distribution  are  concentrated  at  the  opposite  azimuths  and  are 
highly  concentrated  such  distribution  is  called  diametrically  bimodal  circular  distribution.  One 
convenient  method  of  calculating  the  mean  angle  of  such  bimodal  distribution  is  angle  doubling, 
which  has  an  effect  of  folding  the  data.  In  this  method,  each  angle  is  doubled  and  if  all  doubled 
angles  are  smaller  than  360°  than  above  described  vector-based  procedure  to  calculate  the  mean 
angle  can  be  used.  If  the  doubled  angle  is  larger  than  360°,  then  360°  are  subtracted  from  this 
angle  prior  to  adding  it  to  averaging  procedure  (e.g.,  Marr,  201 1). 

The  unimodal  limitation  of  the  Rayleigh  z  test  does  not  apply  to  the  Rao  t/„  test  of  uniformity 
(Rao  Jammalamadaka  and  SenGupta,  1972),  which  tests  the  hypothesis  of  a  uniform  distribution 
(Ho)  against  the  hypothesis  of  single  or  multimodal  distribution  (Hi).  In  the  Rao  test,  all  the 
observations  6i  are  arranged  in  increasing  order,  and  the  angular  distances  between  successive 
observations  are  calculated  and  compared  against  the  average  angular  distance  6o=  360°/n.  The 
sum  of  absolute  deviations  from  9o  is  used  as  the  test  statistics 

=- I  Id-do  I  •  (41) 

2  i=l 
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Small  values  of  Un  indicate  a  uniformly  distributed  data  set.  Critical  values  of  the  Rao  U„ 
statistics  can  be  found,  for  example,  in  Rao  Jammalamadaka  and  SenGupta  (1972). 

In  some  practical  cases,  it  is  useful  to  test  whether  the  sample  data  are  oriented  in  a  specific 
predetermined  direction.  For  example,  if  the  investigator  has  reasons  to  expect,  in  advance,  that 
the  data  set  will  be  oriented  toward  a  specific  direction  and  would  like  to  test  this  prediction  (Zar, 
1999).  This  hypothesis  can  be  tested  using  the  V  test  (Greenwood  and  Durand,  1955;  Durand 
and  Greenwood,  1958)  with  Ho  and  Hi  hypotheses: 

•  Hq!  the  population  data  are  randomly  distributed  in  reference  to  the  predicted  direction 

•  Hp  the  population  data  are  not  randomly  distributed  in  reference  to  the  predicted  direction 
The  test  statistic  V  is  a  modified  Rayleigh  z  statistic  and  is  computed  as 

V  =  nRcos(0^-0p^^^)  (42) 


and  the  critical  value  u  is  calculated  as 


(43) 


The  critical  values  of  u(a,n)  are  available,  for  example,  in  Zar  (1999,  table  B-35). 

The  Rayleigh,  Rao,  and  V  tests  can  be  considered  to  be  tests  of  the  significance  of  the  mean  and 
do  not  require  any  assumptions,  except  for  unimodality  in  the  case  of  the  Rayleigh  test,  about  the 
underlying  distribution.  They  can  be  used  to  test  for  the  lack  of  a  single  modality,  the  lack  of 
any  modality,  or  the  lack  of  a  specific  modality  in  the  data  set,  respectively. 


To  test  whether  a  given  theoretical  distribution  is  supported  by  evidence  from  the  data  set,  the 
nonparametric  Kuiper  test  (modified  Kolgomorow-Smimow  test)  or  Watson  one  sample  U  test 
can  be  used  (Kuiper,  1962;  Mardia,  1972;  Zar,  1999).  The  Watson  two  sample  test  can  be 
used  to  compare  two  data  distributions.  The  Watson  two-sample  statistic  is  calculated  as 
(Watson,  1962): 
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where  N  =  nj+n2,  ni  and  n2  are  the  two  sample  sizes,  dk  =  i/ni  +j/n2,  and  i  and  j  are  the 
respective  ranks  of  the  specific  angular  values  within  each  sample.  Critical  values  of  the  Watson 
U  test  and  many  other  statistical  tests,  both  parametric  and  nonparametric,  that  are  applicable  to 
circular  data  can  be  found  in  many  advanced  statistics  books  (e.g.,  Batschelet,  1981;  Mardia, 
1972;  Zar,  1999;  Rao  Jammalamadaka  and  SenGupta,  2001).  For  example,  the  nonparametric 
homogeneity  test  known  as  the  Wheeler- Watson-Mardia  (WMM)  test  (Batschelet,  1981;  Jin  et 
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al.,  2004)  can  be  used  to  measure  the  similarity  of  two  different  horizontal  and  vertical  angular 
data  distributions.  The  special-purpose  Oriana  (http://www.kovcomp.co.uk)  statistical  package 
provides  direct  support  for  circular  statistics.  Other  statistical  software  that  supports  circular  and 
spherical  statistical  analysis  includes  SAS  macros  (e.g.,  Kolliker,  M.  2005),  a  MATLAB 
Toolbox  for  Circular  Statistics  (Berens,  2009),  and  CircStat  for  S-Plus,  R,  and  Stata  (e.g.,  Rao 
Jammalamadaka  and  Sen  Gupta,  2001). 

Finally,  regardless  of  whether  the  localization  data  are  analyzed  using  circular  (spherical)  or 
alternative  linear  statistics,  it  is  not  sufficient  to  report  only  the  significance  level  (p-value)  of  the 
measured  effect.  The  p-value  only  tells  whether  the  sample  size  n  is  large  enough  to  state  that  a 
given  value  is  statistically  different  from  a  certain  criterion,  but  it  does  not  tell  by  how  much. 
Almost  any  trivial  difference  can  be  statistically  significant  if  the  sample  size  is  large  enough. 
Similarly,  the  size  of  the  observed  effect  may  be  quite  large,  but  due  to  a  small  sample  size,  the 
effect  may  be  not  statistically  significant.  Therefore  in  all  cases,  whether  or  not  the  test  results 
are  statistically  significant,  it  is  important  to  calculate  and  report  a  standardized  measure  of  the 
effect  size  (Hedges,  2007).  Such  measures  include  measures  of  association  (e.g.,  Pearson  r, 
coefficient  of  regression),  measures  of  difference  between  groups  (e.g.,  Cohen’s  d,  Hodges’g), 
and  odds  ratios.  Most  of  these  measures  have  linear,  circular,  and  linear-circular  variants  that 
can  be  used  depending  on  the  type  of  the  data  set  (Batschelet,  1981,  pp.  184-196). 


8.  Localization  Discrimination 


Localization  discrimination  is  a  relative  judgement  of  the  spatial  location  of  one  object  in 
reference  to  another.  The  basic  metric  of  relative  localization  ability  of  the  listener  is  the 
minimum  audible  angle  (MAA).  The  MAA  is  the  minimum  detectable  difference  in  azimuth  (or 
elevation)  between  locations  of  two  identical  but  not  simultaneous  sound  sources^^  (Mills,  1958; 
1972;  Perrott,  1969).  In  other  words,  the  MAA  is  the  smallest  perceptible  difference  in  the 
position  of  a  sound  source.  It  indicates  the  “resolution”  of  the  auditory  localization  system.  The 
MAAs  for  pure  tones  measured  at  various  directions  of  incoming  sound  are  shown  in  figure  15. 
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In  some  MAA  studies  the  second  stimulus  starts  before  the  end  of  the  first  one  (e.g.,  Perrott  and  Pacheco,  1989). 
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Figure  15.  MMAs  for  pure  tones  measured  at  various  directions  of  incoming  sound. 

Adapted  from  Mills  (1958). 

To  measure  the  MAA,  the  listener  is  presented  with  two  successive  sounds  coming  from  two 
different  although  nearby  locations  in  space  and  is  asked  to  determine  whether  the  second  sound 
came  from  the  left  or  the  right  of  the  first  one.  Since  both  locations  are  in  close  proximity,  the 
CE  of  the  two  sound  source  positions  is  constant  and  can  be  subtracted  out.  Thus  MAA  does  not 
include  any  CE  and  is  only  a  measure  of  RE  (e.g.,  Hartmann,  1983a). 

The  MAA  is  calculated  as  half  the  angle  between  the  minimal  positions  to  left  and  right  of  the 
sound  source  that  result  in  a  75%  correct  response  rate.  It  depends  on  both  the  frequency  and  the 
direction  of  arrival  of  the  sound  wave.  Eor  wideband  stimuli  and  low  frequency  tones,  the  MAA 
is  on  the  order  of  1°  to  2°  for  the  frontal  position  (Mills,  1958;  1972;  Perrott  and  Saberi,  1990), 
increases  to  8-10°  at  90°  (Kuhn,  1987),  and  decreases  again  to  6-7°  at  the  rear  (Blauert, 
1974/2001;  Mills,  1958;  Perrott,  1969).  Eor  low  frequency  tones  arriving  from  the  frontal 
position,  the  MAA  corresponds  well  with  the  difference  limen  (DE)  for  ITD  (-10-20  ps)  (Yost 
and  Hafter,  1987),  and  for  high  frequency  tones,  it  matches  well  with  the  difference  limen  for 
TIP  (0.5-1. 0  dB),  both  measured  by  earphone  experiments.  The  MAA  is  largest  for  mid-high 
frequencies,  especially  for  angles  exceeding  40°  (Mills,  1958;  1960;  1972).  The  size  of  the 
MAA  also  depends  on  the  duration  of  the  interstimulus  interval  (ISI)  between  the  onset  of  the 
first  and  second  stimulus.  As  the  ISI  increases,  the  MAA  initially  decreases  and  becomes  IST 
independent  for  durations  exceeding  100-150  ms  (Perrott  and  Pacheco,  1989;  Strybel  et  ah, 
2000).  An  ISI  duration  of  1 00- 1 50  ms  may  be  interpreted  as  the  minimum  time  needed  for  the 
resolution  of  two  spatially  different  sound  sources  (Perrott  and  Pascheco,  1989).  This  time 
agrees  quite  well  with  the  150-200  ms  minimum  switching  time  reported  by  Blauert  (1972)  for 
the  resolution  of  a  “ping-pong”  effect  presented  to  the  listener  through  earphones. 
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The  vertical  MAA  is  about  3°-9°  at  frontal  position  for  sound  sources  in  the  median  plane  (e.g., 
Blauert,  1974/2001;  Perrott  and  Saberi,  1990).  However,  Grantham  et  al.  (2003)  reported  that 
only  6  of  their  20  listeners  produced  a  vertical  MAA  of  less  than  10°  for  wideband  noise  signals 
recorded  through  Knowles  Electronic  Manikin  for  Acoustic  Research  (KEMAR)  ears  and  played 
to  the  listeners  through  insert  earphones.  Perrott  and  Saberi  (1990)  and  Saberi  and  Perrott  (1990) 
also  measured  MAAs  for  sound  sources  aligned  in  several  diagonal  planes  in  front  of  the  listener 
and  reported  that  they  remained  similar  (within  1°)  to  those  measured  at  the  0°  plane  until  the 
angle  of  the  plane  increased  above  80°.  The  authors  concluded  that  the  MAA  for  frontal 
positions  is  practically  independent  of  the  plane  of  presentation  until  the  plane  becomes  nearly 
vertical.  This  observation  does  not  hold  for  the  rear  hemisphere,  where  MAAs  at  the  60°  plane 
were  almost  twice  as  large  as  those  measured  in  the  frontal  hemisphere  (Saberi  et  al.,  1991a).  In 
contrast,  Grantham  et  al.  (2003)  reported  slightly  (but  significantly)  larger  MAA  values  (about 
3°)  in  the  diagonal  direction  (60°)  than  in  the  horizontal  plane  (about  1.5°)  and  even  larger  values 
in  the  vertical  direction  (about  6°).  Their  data,  as  well  as  those  of  Saberi  et  al.  (1991a)  but  not  of 
Perrott  and  Saberi  (1990)  and  Saberi  and  Perrott  (1990),  are  consistent  with  the  hypothesis  that 
the  compound  EE  (see  section  5)  is  based  on  independent  contributions  of  the  horizontal  and 
vertical  EEs.  Eurther  studies  are  needed  to  resolve  this  issue. 

The  MAA  has  frequently  been  considered  to  be  the  smallest  attainable  precision  (difference 
limen)  in  absolute  sound  source  localization  in  space  (e.g.,  Hartmann,  1983a;  Hartmann  and 
Rakerd,  1989a;  Recanzone  et  al.,  1998).  However,  the  precision  of  absolute  localization 
judgments  observed  in  most  studies  is  generally  much  poorer  than  the  MAA  for  the  same  type  of 
sound  stimulus.  Eor  example,  the  average  error  in  absolute  localization  for  a  broadband  sound 
source  is  about  5°  for  the  frontal  and  about  20°  for  the  lateral  position  (Hofman  and  Van  Opstal, 
1998;  Eangendijk  et  al.,  2001).  Thus,  it  is  possible  that  the  MAA  observed  in  these  studies, 
where  two  sounds  are  presented  in  succession,  and  the  precision  of  absolute  localization,  where 
only  a  single  sound  is  presented,  are  not  well  correlated  and  measure  two  different  human 
capabilities  (Moore  et  al.,  2008).  This  view  is  supported  by  results  from  animal  studies, 
indicating  that  some  types  of  lesions  in  the  brain  affect  the  precision  of  absolute  localization  but 
not  the  MAA  (e.g..  May,  2000;  Young  et  al.,  1992).  In  another  set  of  studies,  Spitzer  and 
colleagues  (Spitzer  et  al.,  2003;  Spitzer  and  Takahasi,  2006)  observed  that  barn  owls  exhibited 
different  MAA  performance  in  anechoic  and  echoic  conditions  while  displaying  similar 
localization  precision  across  both  conditions.  The  explanation  of  these  differences  may  be  the 
difference  in  the  perceptual  tasks  and  the  much  greater  difficulty  of  the  absolute  localization 
task.  In  contrast.  Recanzone  et  al.  (1998)  observed  that  absolute  localization  performance  can  be 
predicted  from  the  slope  of  the  psychometric  function  obtained  in  the  MAA  experiment  (but  not 
from  the  MAA  value  itself). 

The  MAA  in  synthetic  environments  (earphones  with  HRTE  sound  synthesis)  are  generally  much 
larger  than  those  reported  for  the  natural  sound  field.  The  horizontal  virtual  MAA  at  0°  azimuth 
was  reported  to  be  on  the  order  of  5°-10°,  and  the  vertical  virtual  MAA  on  the  order  of  15°-35° 
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(McKinley  and  Ericson,  1997;  Wersenyi,  2007).  However,  the  size  of  virtual  MAAs  depends  on 
the  quality  of  the  rendered  space  and  the  type  of  earphones  used  for  the  study  (e.g.,  circumaural 
earphones  versus  insert  earphones). 


9.  Absolute  Localization 


Absolute  localization  is  the  identification  of  the  direction  of  an  incoming  sound  in  absolute 
terms,  i.e.,  without  using  a  previously  heard  sound  as  a  reference  point.  As  opposed  to  the 
localization  discrimination  task,  LE  in  the  absolute  localization  task  contains  both  error 
components:  CE  and  RE  (e.g.,  Hartmann,  1983a).  Unfortunately,  in  many  reports,  EE  is  only 
reported  as  either  MUE  or  RMSE,  and  the  relative  contributions  of  CE  and  RE  are  typically  not 
reported  (e.g.,  Wenzel  et  ah,  1993;  Wightman  and  Kistler,  1989b). 

The  sizes  of  CE  and  overall  EE  (CE  together  with  RE)  as  reported  in  various  absolute 
localization  studies  differ  considerably  across  the  studies.  This  is  caused  by  the  fact  that  CE  is 
dependent  on  the  asymmetry  and  specific  character  of  the  surrounding  environment;  the 
asymmetry  of  the  listener’s  reception  mechanism;  and  the  asymmetry  of  any  potential  headgear 
worn  by  the  listener.  The  size  of  both  the  absolute  CE  and  RE  is  also  dependent  on  stimulus 
duration  and  its  temporal  (impulsive)  properties  (Iwaya  et  ah,  2003;  Pollack  and  Rose,  1967; 
Roffler  and  Butler,  1968b).  These  effects  are  more  pronounced  in  the  vertical  than  horizontal 
plane.  In  addition,  both  CE  and  RE  strongly  depend  on  the  signal  type  used  in  the  study.  The 
largest  CEs  in  the  horizontal  and  vertical  planes  have  been  reported  for  pure  tones,  and  their  size 
decreases  with  increasing  bandwidth  and  complexity  of  the  emitted  sound  (Blauert,  1974/2001; 
Jacobsen,  1976).  In  contrast,  localization  acuity  does  not  seem  to  be  affected  by  the  temporal 
cues  provided  by  amplitude  modulation  of  the  target  sound  (Eberle  et  ah,  2000)  and  is  well 
maintained  at  various  levels  of  sustained  acceleration  (Nelson  et  ah,  2001). 

In  general,  CE  is  the  smallest  for  frontal  positions  and  increases  with  sound  source  laterality  and 
elevation.  The  largest  localization  errors  have  been  observed  for  sound  source  positions  behind 
the  listener,  especially  for  sound  sources  not  located  on  the  listener’s  interaural  plane.  Eor 
frontal  positions  and  wideband  sounds,  the  reported  CEs  have  been  as  small  as  2°^°  in  azimuth 
and  3.5°^°  in  elevation  (Bauer  and  Blackmer,  1965;  Best  et  ah,  2009;  Carlile  et  ah,  1997; 
Makous  and  Middlebrooks,  1990;  Oldfield  and  Parker,  1984a;  Razavi,  2009)  and  as  large  as 
10°-15°  in  azimuth  (Makous  and  Middlebrooks,  1990;  Tiitinen  et  al.  2004;  Tonning,  1970)  and 
15°-20°  in  elevation  (Bauer  and  Blackmer,  1965;  Tiitinen  et  al.  2004).  Eor  lateral  horizontal 
positions,  they  are  on  the  order  of  10°,  and  for  rear  horizontal  positions  they  can  be  as  large  as 
20°-25°  (e.g.,  Blauert,  1974/2001;  Oldfield  and  Parker,  1984a;  Rakerd  and  Hartmann,  1985). 
Savel  (2009)  observed  that  CE  in  the  horizontal  plane  (low  frequency  bands  of  noise,  50 
listeners)  has  a  tendency  to  increase  linearly  or  logarithmically  with  the  laterality  of  the  sound 
source.  She  also  reported  that  58%  of  her  listeners  showed  a  judgment  bias  toward  the  medial 
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axis.  Such  an  overwhelming  bias  toward  the  medial  axes  was  also  reported  in  earlier  studies 
(e.g.,  Sandel  et  ah,  1955;  Wells  and  Ross,  1980).  Nevertheless,  Savel  (2009)  also  reported  that 
21%  of  her  listeners  demonstrated  bias  toward  the  interaural  axis.  Tonning  (1970)  and  Oldfield 
and  Parker  (1984a)  observed  that  CE  in  the  horizontal  plane  is  the  smallest  at  30°  and  330°  and 
the  largest  at  120°-160°  and  200°-240°  with  respect  to  the  listener’s  front.  Relatively  large  CEs 
at  120°  and  240°  in  both  the  horizontal  and  frontal  planes  were  also  reported  by  Wilska  (1938). 
That  the  smallest  CEs  were  observed  at  30°  and  330°  may  be  related  to  the  fact  that  in  this 
angular  range,  especially  at  50°  and  310°,  the  pinna  works  as  a  parabolic  reflector,  greatly 
amplifying  incoming  sounds  (Kuhn,  1987). 

Many  authors  reported  that  accuracy  of  absolute  localization  increases  with  the  increase  in  the 
signal  bandwidth  (e.g.,  Blauert,  1974/2001;  Burger,  1958;  Butler,  1986;  Middlebrooks,  1992). 
Eor  example,  Shigeno  and  Oyama  (1983)  compared  localization  accuracy  of  white  noise  pulses, 
speech,  and  pure  tone  signals  in  the  horizontal  plane  and  reported  the  largest  CE  for  pure  tones 
and  the  smallest  for  white  noise  pulses.  Eor  sound  sources  located  close  to  the  midline,  the  size 
of  the  error  has  also  a  tendency  to  increase  with  frequency.  Eor  example.  Pierce  (1901)  reported 
a  EE  of  10°  and  20°  at  125  Hz  and  2-5  kHz,  respectively.  This  monotonic  relationship  between 
EE  and  frequency  does  not  hold,  however,  for  lateral  angles,  that  is,  for  sound  sources  not 
located  on  the  median  plane  (e.g.,  Giguere  and  Abel,  1993).  In  general,  most  reports  indicate 
that  listeners  tend  to  overestimate  the  actual  lateral  position  of  sound  sources  located  at  angles 
larger  than  30°  by  5°-15°  in  both  natural  (Oldfield  and  Parker,  1984a)  and  virtual  environments 
(Carlile  et  ah,  1997;  Majdak  et  ah,  2010).  In  contrast,  Perrott  et  al.  (1987)  reported  that  their 
listeners  had  a  tendency  to  underestimate  the  lateral  positions  of  the  sound  sources.  However, 
they  assumed  that  this  tendency  was  the  results  of  the  specific  head  movement  response  (HMR) 
technique  used  in  their  study  rather  than  a  characteristic  property  of  the  auditory  spatial  function. 

Dobbins  and  Kindick  (1967)  studied  EE  in  the  horizontal  plane  in  a  jungle  environment  and 
observed  that  the  size  of  the  CE  varied  with  the  direction  toward  the  actual  sound  source  but  not 
as  much  with  the  type  of  the  emitted  sound  (pure  tones,  real-life  noises,  impact  and  impulse 
sounds).  The  errors  were  largest  for  lateral  angles  (25-30°)  and  smallest  for  the  frontal  direction 
(15-20°).  Eor  lateral  angles,  the  errors  tended  to  be  in  the  direction  of  the  closer  ear,  while  for 
sound  sources  located  in  the  frontal  plane,  they  tended  to  be  toward  the  front  of  the  listener. 
Caelli  and  Porter  (1980)  evaluated  drivers’  ability  to  determine  the  direction  of  an  incoming 
emergency  vehicle’s  hee-haw  siren  and  reported  a  CE  of  20°.  In  a  similar  study,  Bauer  (1953) 
investigated  listeners’  ability  to  determine  the  azimuth  of  the  approach  (or  departure)  of  a  low- 
altitude-flying  and  invisible  (obscured  by  vegetation)  UH-IB  helicopter.  He  reported  an 
absolute  mean  localization  error  (MUE)  of  9°  for  experimental  conditions  in  which  the  sound  of 
the  helicopter  was  clearly  audible.  The  use  of  a  steel  helmet  (Ml)  did  not  affect  localization 
precision. 

The  presence  of  noise  greatly  degrades  localization  performance  and  affects  the  directional 
detection  threshold  (DDT)  of  sound  arriving  from  an  unknown  direction  (Carhart  et  al.,  1969; 
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Good  and  Gilkey,  1996;  Kock,  1950).  For  a  speech  sound  source  located  in  the  horizontal  plane, 
a  signal-to-noise  ratio  (SNR)  of  about  -9  to  -6  dB  is  needed  for  the  source  to  be  reliably  detected 
in  the  presence  of  diffuse  background  noise  and  localization  performance  increases  almost 
linearly  with  increasing  SNR,  reaching  its  maximum  resolution  at  SNRs  of  8-12  dB  (e.g., 
Abouchacra  et  ah,  1998a;  Abouchacra  and  Letowski,  2001;  Canevet,  1985;  Hirsh,  1950; 
Jacobsen,  1976).  Similar  data  were  reported  for  click  signals  (Good  and  Gilkey,  1996;  Lorenzi 
et  ah,  1999).  When  a  noise  sound  source  is  directional,  and  the  target  sound  and  noise  masker 
originate  from  different  locations,  the  DDT  improves  by  as  much  as  16-20  dB  for  non-speech 
and  13-15  dB  for  speech  targets  in  comparison  to  the  situation  when  the  locations  of  masker  and 
maskee  coincide  (Abouchacra  et  ah,  1996;  1998b;  Good  et  ah,  1997;  Saberi  et  ah,  1991b).  The 
spatial  distribution  of  the  target  and  distracting  speech  sound  sources  also  affects  speech 
intelligibility  of  the  target  speech  (cocktail  party  effect).  For  example,  Ricard  and  Meirs  (1994) 
reported  that  when  the  speech  signal  and  directional  masker  are  presented  from  different 
directions  (30°  or  more  apart)  in  an  AVR,  the  increase  in  speech  intelligibility  is  equivalent  to 
about  a  4-dB  improvement  in  SNR.  Special  tests  developed  to  measure  the  intelligibility  of  a 
target  speech  signal  produced  in  the  presence  of  other,  spatially  separated,  speech  sources 
include  the  Coordinated  Measure  Response  (CMR)  and  the  Synchronized  Sentence  Set  (S  )  tests 
(Abouchacra  et  ah,  2009;  2011;  Bolia  et  ah,  2000;  Ericson  and  McKinley,  1997;  Moore,  1981). 

Sounds  presented  simultaneously  with  or  shortly  preceding  the  target  signal  induce  both  bias  and 
variability  in  target  localization  response.  Such  sounds  were  once  considered  as  reference 
sounds  or  acoustic  cues  that  could  improve  localization  accuracy.  However,  the  opposite  is  true 
and  such  sounds  behave  as  distracters  (Kopco  and  Shinn-Cunningham,  2001).  The  perceived 
location  of  the  target  sound  source  may  be  either  “attracted  toward”  or  “repulse  away  from”  the 
distracter  location  depending  on  the  position  of  both  sound  sources  and  the  stimuli 
characteristics,  such  as  frequency  (Butler  and  Naunton,  1962;  Good  and  Gilkey,  1996;  Kashino 
and  Nishida,  1996;  Kopco  et  al.  2007).  It  is  also  possible  that  the  spatial  memory  of  a  short 
sound  produced  by  the  target  sound  source  within  the  background  of  other  spatialized  sound 
sources  may  be  biased  toward  the  center  of  the  quadrant  in  which  the  target  sound  source  was 
located;  an  effect  observed  in  visual  localization  studies  (e.g..  Fitting  et  al.,  2007).  Abouchacra 
and  Letowski  (2001)  presented  simultaneous  speech  (target)  and  noise  (distracter)  sound  sources 
that  were  spatially  separated  and  observed  that  listeners  had  a  tendency  to  hear  the  speech  source 
as  coming  from  a  more  lateral  location  that  it  was  actually  in,  and  this  effect  was  observed 
regardless  of  the  positions  of  the  noise  source.  The  opposite  shifts  toward  the  median  plane  were 
observed  in  earlier  studies  by  Sandel  et  al.  (1955)  and  Butler  et  al.  (1967).  Getzmann  (2003) 
examined  the  effect  of  distracters  on  the  localization  of  target  sound  sources  in  both  the 
horizontal  and  vertical  planes  and  reported  that  the  presence  of  a  distracter  caused  listeners  to 
shift  the  position  of  the  perceived  target  sound  source  away  from  the  distracter  in  both  planes, 
but  that  this  “contrast  effect”  was  stronger  in  the  vertical  plane.  Under  some  conditions,  the 
perceived  location  of  the  target  sound  source  is  shifted  away  from  the  position  of  the  distracter 
even  when  the  distracter’s  sound  terminates  prior  to  the  presentation  of  the  target  sound.  Similar 
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tendencies  were  reported  by  Lorenzi  et  al.,  (1999)  and  Kopco  et  al.  (2001).  In  contrast,  several 
other  authors  reported  the  listener’s  tendency  to  judge  the  location  of  the  sound  source  as  shifted 
toward  the  position  of  the  noise  source  (e.g.,  Good  and  Gilkey,  1996;  Good  et  ah,  1997; 
Langendijk  et  ah,  2001;  Massaro  et  ah,  1976).  These  contradicting  results  are  most  likely  due  to 
the  differences  in  the  number  of  positions  of  the  target  sound  source  used  in  various  studies  and 
the  similarity  between  the  target  sound  and  the  distracter. 

When  the  preceding  sound  (cueing  sound)  arrives  exactly  from  the  same  direction  as  the  future 
target  sound,  target  localization  performance  improves  in  comparison  to  the  no-cueing  control 
condition  (Braasch  and  Hartung,  2002;  Canevet  and  Meunier,  1994;  1996;  Carlile  et  ah,  2001; 
Thurlow  and  Jack,  1973b).  Getzmann  (2004)  and  several  previous  authors  attributed  this  effect 
to  the  presence  of  auditory  spatial  adaptation.  In  another  study,  Langendijk  et  al.  (2001)  observed 
that  target  sound  source  location  uncertainty  (LE)  increased  with  increasing  number  of 
distracters  (from  0  to  2)  and,  for  a  single  distracter,  with  decreasing  horizontal  angular  distance 
between  the  target  and  distracter.  The  increased  LE  was  due  to  both  types  of  errors:  increased 
uncertainty  of  the  target  location  (RE)  and  attraction  to  or  confusion  with  the  distracters  (CE). 
Though  the  angular  distance  between  the  target  and  distracter  had  no  systematic  effect  in  the 
vertical  dimension,  EEs  increased  substantially  for  target  elevations  exceeding  30°  when 
distracters  were  present. 

Several  authors  have  also  assessed  the  minimal  distance  that  can  be  distinguished  between  two 
sequentially  presented  sounds,  e.g.,  target  and  masker,  but  these  studies  belong  to  the  group  of 
studies  discussed  in  section  8. 

Absolute  localization  performance  in  the  vertical  plane  depends  on  the  presence  of  monaural 
(pinna)  cues,  sound  complexity,  and  high  frequency  spectral  sound  content  (Pierce,  1901;  Roffler 
and  Butler,  1968b).  Typical  localization  errors  are  on  the  order  of  4°  for  a  broadband  noise 
source  to  10°  for  a  speech  source  (Damaske  and  Wagener,  1969;  Gilkey  and  Anderson,  1995; 
Wettschurek,  1971).  They  can  be  as  large  as  15°-20°  for  pure  tones  and  narrowband  noises. 

The  size  of  the  error  increases  with  greater  vertical  deviations  from  the  horizontal  plane  as  well 
as  for  locations  behind  the  listener.  Strybel  et  al.  (1992a)  found  LE  in  the  vertical  plane  to  be 
largest  at  80°  of  elevation.  In  addition,  Davis  and  Stephens  (1974)  observed  that  the  vertical 
mean  absolute  error  (MUE)  decreased  monotonically  as  sound  intensity  increased  from  a  10-dB 
sensation  level  (SL)  to  a  70-dB  SL  reaching  a  plateau  at  about  3.5°  at  a  -50-60  dB  SL.  The 
differences  in  the  size  of  the  localization  error  between  a  20-  and  50-dB  SL  (experiment  1)  and  a 
10-  and  30-dB  SL  (experiment  2)  were  statistically  significant.  A  similar  dependency  between 
sound  intensity  and  localization  performance  was  reported  by  Altshuler  and  Comalli  (1975)  and 
Comalli  and  Altshuler  (1980). 

Short  impulse  sounds  (<  30  ms)  are  especially  poorly  localized  in  elevation,  which  leads  to  front- 
back  overhead  confusions  (Hartmann  and  Rakerd,  1993)  and  a  general  negative  shift  in 
perceived  elevation  toward  the  horizontal  plane  (Best  et  ah,  2009;  Hofman  and  Van  Opstal, 
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1998;  Macpherson  and  Middlebrooks,  2000;  Vliegen  and  Van  Opstal,  2004).  In  addition,  the 
magnitude  of  the  negative  CE  increases  with  increasing  signal  level  (Brungart  and  Simpson, 
2008;  Hartmann  and  Rakerd,  1993;  Macpherson  and  Middlebrooks,  2000),  although  Vliegen  and 
Van  Opstal  (2004)  reported  that  in  their  study  the  observed  CE  was  not  a  monotonic  function  of 
sound  level  and  was  lowest  for  a  40-50  dB  SE. 

Pedersen  and  Jorgensen  (2005)  reported  that  the  size  of  CE  in  the  median  plane  depends  on  the 
actual  sound  source  elevation  and  is  about  +3°  at  the  horizontal  plane,  0°  at  about  23°  elevation, 
and  becomes  negative  at  higher  elevations  (e.g.,  -3°  at  about  46°).  Conversely,  Oldfield  and 
Parker  (1984a)  observed  a  small  vertical  CE  (<±5°)  that  did  not  change  with  elevation. 

However,  it  was  slightly  affected  by  the  azimuth  of  the  sound  source,  tending  to  be  negative  in 
front  of  and  positive  behind  the  listener.  Best  et  al.  (2009)  conducted  a  meta-analysis  of  more 
than  50,000  localization  trials  using  data  collected  in  several  laboratories  and  concluded  that 
elevation  errors  are  (1)  biased  toward  the  horizontal  plane  when  the  sound  source  is  located  in 
the  frontal  hemisphere,  (2)  biased  forward  and  in  the  lateral  direction  for  sound  sources  located 
in  the  rear  hemisphere,  and  (3)  largest  for  sound  sources  located  overhead  and  slightly  behind  the 
listener.  Regardless  of  direction,  the  size  of  the  CE  is  independent  of  the  distance  from  the 
sound  source  and  is  similar  in  the  proximal  and  distal  regions  as  long  as  the  source  is  clearly 
audible  (Brungart,  1999).  The  size  of  CE  can,  however,  be  greatly  influenced  by  the  listener’s 
experience  (familiarity  with  the  sound  sources)  and  expectations  (Angell  and  Eite,  1901a;  1901b; 
Roffler  and  Butler,  1968a).  Both  Pratt  (1930)  and  Roffler  and  Butler  (1968a)  provided 
experimental  evidence  that  listeners  expect  low-pitch  sounds  to  be  generated  by  lower-placed 
sound  sources  than  high-pitch  sounds.  According  to  Pratt  (1930)  “...prior  to  any  associative 
addition  there  exist  in  every  tone  an  intrinsic  spatial  character  which  leads  directly  to  the 
recognition  of  differences  in  height  and  depth  along  the  pitch-continuum.” 

Harima  et  al.  (1997)  investigated  the  localization  of  virtual  sound  images  generated  by  two 
separate  sound  sources  located  in  the  median  plane.  They  reported  that  when  the  sound  sources 
were  located  in  front  of  the  listener,  the  sound  image  was  localized  at  about  halfway  between  the 
sources  as  long  as  the  separation  angle  did  not  exceed  45°.  The  sound  image  was  more  vague  at 
higher  elevations  and  in  the  rear  of  the  listener.  Eor  a  separation  angle  of  60°,  a  fused  image  was 
hardly  possible,  and  the  listeners  tended  to  localize  the  sound  image  higher  than  the  midpoint 
between  the  two  physical  sources. 

REs  in  the  absolute  localization  of  sound  sources  in  the  horizontal  and  vertical  planes  in  front  of 
the  listener  are  reported  to  be  4°-8°  and  6°-8°,  respectively  (Bronkhorst,  1995;  Pedersen  and 
Jorgensen,  2005),  although  they  can  be  as  small  as  l°-3°  for  sound  source  discrimination  tasks 
(Blauert,  1974/2001)  and  as  large  as  15°  in  virtual  environments  (e.g.,  Bergault  and  Wenzel, 
1993;  Majdak  et  al.,  2010).  The  size  of  RE  increases  slightly  with  sound  source  laterality  but  to 
a  lesser  degree  than  the  size  of  CE  (Perrott  et  al.,  1987).  The  poorest  precision  for  localization  in 
the  horizontal  plane  has  been  reported  for  angles  close  to  +150°  (Tonning,  1970).  Eor  a  jungle 
environment,  Dobbins  and  Kindick  (1967)  reported  mean  REs  in  the  horizontal  plane  of  25°, 
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20°,  and  15°  for  tones,  real-life  noises,  and  impact  and  impulse  sounds,  respectively.  These 
values  were  calculated  with  the  exclusion  of  all  reversal  (front-back)  errors  from  the  data  set  (see 
section  10).  When  the  reversal  errors  were  included  in  the  calculations,  the  mean  errors 
increased  to  39°,  29°,  and  23°,  respectively.  Similar  errors  (20°-30°)  were  reported  by  Eyring 
(1945)  for  the  localization  of  rifle  shooter  positions  in  a  jungle  environment.  In  general,  the  size 
of  RE  in  the  horizontal  plane  is  smaller  than  in  the  median  plane  for  frontal  locations,  but  this 
pattern  is  reversed  for  locations  in  the  rear  of  the  listener  (Makous  and  Middlebrooks,  1990; 
Carlile  et  ah,  1997).  In  all  the  studies  cited,  intrasubject  variability  of  the  data  was  much  smaller 
than  intersubject  variability. 

Oldfield  and  Parker  (1984a)  assessed  the  ability  of  listeners  to  localize  sound  sources  that  varied 
simultaneously  in  both  their  horizontal  and  vertical  location.  They  reported  MUE  of  9.1°  and 
8.2°  in  the  horizontal  and  vertical  planes,  respectively.  The  mean  horizontal  error  was  largest  at 
+130°-150°  and  reached  about  15°,  while  the  mean  vertical  error  was  relatively  independent  of 
azimuth.  Similar  studies  were  conducted  by  Wightman  and  Kistler  (1989b)  and  Makous  and 
Middlebrooks  (1990).  While  Wightman  and  Kistler  did  not  reported  separate  horizontal  and 
vertical  errors  but  only  the  compound  error  (see  section  4),  Makous  and  Middlebrooks  reported  a 
range  of  MUEs  from  1.5°  in  the  horizontal  plane  and  3.5°  in  the  vertical  plane  for  frontal  sound 
source  positions  (0°  position  in  both  the  horizontal  and  vertical  plane)  to  15-20°  at  certain 
combined  horizontal/vertical  locations.  The  respective  MEs  were  as  small  as  0°  and  -0.3°  and  as 
large  as  -13°  and  -i-17°  in  the  horizontal  and  vertical  planes,  respectively.  Standard  deviations 
(REs)  varied  from  as  little  as  2.0°  to  as  much  as  10.0°  in  horizontal  plane  and  increased  with  the 
degree  of  laterality.  In  the  vertical  plane,  they  were  on  the  order  of  4°  at  frontal  locations  and  as 
large  as  7°-8°  at  the  extreme  positive  and  negative  elevations.  The  data  were  screened  for  front- 
back  errors  (see  section  10). 

The  effect  of  sound  reflections  and  room  reverberation  on  both  accuracy  and  precision  of 
localization  judgments  is  generally  detrimental  and  depends  on  the  space  geometry,  distribution 
and  strength  of  the  reflections,  and  the  position  of  the  listener  within  the  space  (Scharine,  2009; 
Scharine  and  Eetowski,  2005;  Shinn-Cunningham  et  ah,  2005).  Eor  example,  Rakerd  and 
Hartmann  (1985)  and  Guski  (1990)  reported  that  ceiling  and  wall  reflections  are  more 
detrimental  for  auditory  localization  than  floor  reflections.  Eor  some  known  sound  sources,  such 
as  a  human  voice,  floor  reflections  may  even  help  in  improving  the  accuracy  (decreasing  CE)  of 
sound  source  localization  (Guski,  1990).  With  respect  to  the  type  of  emitted  sound,  the  negative 
effects  of  room  acoustics  are  the  strongest  for  narrowband  sounds  and  sounds  with  very  slow  rise 
times,  that  is,  missing  an  onset  time  cue  (e.g.,  Giguere  and  Abel,  1993;  Hartmann,  1983a;  Rakerd 
and  Hartmann,  1986). 

Since  in  natural  environments  the  sound  source  is  typically  facing  the  intended  listener,  sound 
sources  are  likewise  turned  toward  the  listener  in  almost  all  localization  studies.  However,  in 
many  practical  situations  the  sound  source  may  well  be  facing  in  a  different  direction.  This 
would  have  no  effect  on  the  listener  if  the  sound  source  were  non-directional,  but  this  is  rarely 


63 


the  case,  and  no  sound  sources  can  be  considered  non-directional  at  wavelengths  that  are  much 
shorter  than  the  dimensions  of  the  sound  source  (Emanuel  and  Letowski,  2009).  Thus,  if  a  sound 
source  is  not  facing  the  listener,  the  ear  closer  to  the  main  direction  of  the  sound  radiation 
receives  a  relatively  stronger  signal  than  the  other  ear.  This  could  create  a  false  IID  cue  that  may 
result  in  a  noticeable  localization  CE  (Neuhoff  et  ah,  2001;  Tonning,  1970).  Such  errors  can  be 
observed,  for  example,  in  experiments  in  which  the  listener  faces  a  linear  horizontal  array  of 
loudspeakers  that  is  relatively  long  or  if  a  moving  sound  source  moves  at  different  oblique 
angles. 

The  use  of  earplugs,  earmuffs,  or  hearing  aids  also  affects  localization  performance.  In  general, 
in-the-ear  hearing  aids,  which  minimally  obstruct  the  pinna  disturb  localization  cues  to  a  much 
lesser  degree  than  all  other  types  of  hearing  aids  (e.g.,  Eeuuw  and  Dreschler,  1987;  Westermann 
and  Topholm,  1985).  Similarly,  earmuffs  are  more  detrimental  to  localization  acuity  than 
earplugs  (e.g.,  Abel  and  Hay,  1996).  Russell  and  Noble  (1976)  and  Noble  et  al.  (1990)  compared 
localization  judgments  made  with  open  ears,  earplugs,  and  earmuffs  and  observed  that  ear 
occlusion  resulted  in  rearward  CE  for  the  earplug  condition  and  frontward  CE  for  the  earmuff 
condition.  The  authors  concluded  that  the  ear  occlusion  created  false  localization  cues  (CE) 
rather  than  increasing  the  uncertainty  of  the  sound  source  location  (RE).  Ealse  localization  cues 
can  also  be  produced  by  dynamic-range  compression  systems  and  limiters  used  in  some  hearing 
aids,  cochlear  implants,  and  military  tactical  communication  and  protection  systems  (TCAPSs)  if 
they  are  not  synchronized  at  the  two  ears  (e.g.,  Byrne  and  Noble,  1998).  Asynchronous 
compression  was  determined  to  affect  IID  (but  not  ITD)  by  several  authors,  but  its  effects  on 
localization  accuracy  were  practically  negligible  (e.g.,  Mussa-Shufani  et  al.,  2006;  Ricketts  et  al., 
2006;  Grantham  et  al.,  2008).  However,  Wiggins  and  Seeber  (201 1)  reported  that  fast-acting 
asynchronous  compression  at  the  two  ears  significantly  affects  the  perceived  position  of  high- 
pass  sounds  (fc  =  2000  Hz).  The  perceived  locations  of  sound  sources  producing  sounds  with 
abrupt  onset/offset  slopes  were  shifted  to  more  central  positions.  In  contrast,  sounds  with 
gradual  onset  and  offset  (such  as  speech)  were  heard  as  split  or  moving  images  with  increased 
lateral  shift  of  the  perceived  sound  source  positions.  The  severity  of  these  effects  can  be  reduced 
by  the  presence  of  low-frequency  ITD  cues  and  completely  eliminated  by  wireless 
synchronization  of  both  compression  systems  (e.g.,  Sockalingam  et  al.,  2009). 

Eocalization  accuracy  in  headphone -based  AVRs  should  be  theoretically  comparable  to  the 
localization  accuracy  in  a  free-field  as  long  as  an  individualized  HRTE  is  accurately  measured 
and  reproduced.  However,  accurate  implementation  of  AVR  involves  a  number  of  acoustic 
compromises  that  have  made  this  goal  difficult  to  achieve  (Carlile,  1996),  and  absolute 
localization  in  virtual  environments  has  been  typically  found  to  be  less  accurate  than  absolute 
localization  in  real  environments,  even  if  individualized  HRTEs  are  used  (e.g.,  Bronkhorst,  1995; 
Hartmann  and  Wittenberg,  1996;  Wenzel,  1992).  Eor  example,  Wenzel  and  Eoster  (1993), 
Middlebrooks  (1999ab),  and  Bergault  et  al.  (2001)  reported  a  CE  of  15°-25°  in  the  horizontal 
and  vertical  planes  using  individualized  HRTEs. 
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With  respect  to  perception  of  elevation,  the  perception  of  elevation  in  an  AVR  is  far  less  accurate 
than  in  a  free  sound  field  with  a  large  number  of  listeners  perceiving  the  locations  of  virtual 
sound  sources  higher  than  intended  (Folds,  2006).  Pedersen  and  Jorgensen  (2005)  compared  the 
localization  of  real  and  virtual  sound  sources  in  the  horizontal  and  median  planes  and  reported 
REs  of  10°  and  14°  in  the  horizontal  plane  and  of  12°  and  24°  in  the  median  plane  for  real  and 
virtual  sources,  respectively.  Endsley  and  Rosiles  (1995)  reported  errors  reaching  35°  in  the 
horizontal  plane  and  50°  in  the  median  plane  at  large  elevations.  Bergault  and  Wenzel  (1993) 
reported  average  CE  of  17°  and  associated  RE  of  10°  for  sound  sources  in  the  vertical  plane 
presented  at  0  °  elevations.  They  also  reported  an  average  overall  horizontal  EE  of  28°,  while 
Wightman  and  Kistler  (1989b)  reported  a  similar  error  of  20°  and  Endsley  and  Rosiles  (1995)  of 
35°. 

All  these  data  indicate  that  localization  performance  in  virtual  space  is  greatly  dependent  on  the 
precision  of  the  HRTF  measurement,  earphones  equalization,  the  type  of  signal,  listener’s 
movements,  and  the  accuracy  of  the  spatial  rendering  (e.g..  King  and  Oldfield,  1997).  Therefore, 
localization  errors  can  only  be  discussed  keeping  the  specific  AVR  technologies  used  in  the 
given  study  in  mind. 


10.  Reversal  Localization  Errors 


10.1  Types  of  Reversal  Errors 

Reversal  errors  are  direction  estimates  of  the  sound  source  location  that  are  in  the  opposite 
direction  to  the  actual  sound  source  location.  They  occur  when  the  binaural  information 
correlates  equally  well  with  two  opposite  spatial  locations.  The  listener  points  not  at  the  sound 
source  but  at  its  mirror  image.  Such  errors  can  be  caused  by  sound  reflections  from  objects 
surrounding  the  listener,  the  presence  of  headgear  that  affects  the  sound  spectrum,  listener 
expectations,  or  interfering  effects  of  other  sounds  present  in  the  surrounding  environment. 
Reversal  errors  are  most  common  for  short  and  narrowband  sounds,  and  the  frequency  of  these 
errors  decreases  with  increasing  sound  duration  and  complexity  as  the  listener  is  able  to  use  head 
movements,  comprehend  the  spatial  scene,  and  combine  cues  across  a  range  of  sound 
frequencies.  However,  they  can  happen  in  any  environment  and  for  any  sound  source  under  the 
right  circumstances.  An  example  of  such  a  situation  was  reported  by  the  Baltimore  Sun 
(Hermann,  2011).  A  police  officer  searching  for  a  suspect  in  a  dark  area  was  accidentally  shot  in 
the  back  by  another  officer  that  was  following  him.  The  wounded  officer  thought  that  the  shot 
had  come  from  in  front  of  him  and  returned  fire  in  that  direction. 

In  general,  reversal  errors  can  be  front-back  (back-front),  left-right  (right-left),  or  up-down 
(down-up)  errors.  However,  the  presence  of  strong  binaural  localization  cues  in  humans 
practically  prevents  left-right  (right- left)  reversal  errors  from  occurring  (e.g.,  Bloch,  1893).  The 
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left-right  (right-left)  judgment  errors  are  only  made  due  to  sound  source  location  uncertainty 
when  the  sound  source  is  located  close  to  the  median  plane  of  the  listener  or  the  listener’s 
distraction.  Large  left-right  (right-left)  errors  are  very  infrequent  and  usually  constitute  less  than 
l%-2%  of  overall  localization  judgements  (e.g.,  Abel  and  Hay,  1996;  Smith- Abouchacra,  1993; 
Makous  and  Middlebrooks,  1990). 

There  is  continuing  debate  in  the  research  literature  as  to  what  exactly  constitutes  a  reversal 
error,  and  in  particular,  a  front-back  (back-front)  error.  Most  authors  define  front-back  errors  as 
any  estimates  that  cross  the  interaural  axis  (Carlile  et  ah,  1997;  Langendijk  et  ah,  2001;  Wenzel, 
1999).  Other  criteria  include  errors  crossing  the  interaural  axis  by  more  than  5°  (Jin,  2001),  10° 
(Schonstein,  2008)  or  15°  (Best  et  ah,  2009)  or  errors  that  are  within  a  certain  angle  after 
subtracting  180°.  An  example  of  the  last  case  is  using  a  +10°  (e.g.,  Brungart  et  ah,  1999)  or  +20° 
(e.g.,  Carlile  et  ah,  1997)  range  around  the  directly  opposite  angle  (position),  which  corresponds 
closely  to  the  range  of  typical  listener  uncertainty  in  the  frontal  direction. 

10.2  Front-back  and  Back-front  Errors 

Front-back  (FB)  and  back-front  (BF)  errors  are  the  most  common  reversal  errors,  and  they 
happen  under  all  listening  conditions  but  are  fairly  rare  for  open  ear  conditions  and  relatively 
absorptive  environments  (Makous  and  Middlebrooks,  1990).  They  are  most  frequent  for  sound 
sources  located  on  or  near  the  median  plane,  narrowband  sounds,  and  sounds  spectrally  limited 
to  less  than  8  kHz  (Nakabayashi,  1974).  Typical  rates  of  FB/BF  errors  reported  in  auditory 
localization  studies  are  2%-12%  (Hollander,  1994;  Makous  and  Middlebrooks,  1990;  Oldfield 
and  Parker,  1984a;  Wenzel  et  ah,  1993;  Wightman  and  Kistler,  1989b).  For  example,  for  a 
wideband  sound  source  at  the  frontal  position,  Pedersen  and  Jorgensen  (2005)  reported  FB/BF 
error  rates  of  4.2%  and  9.1%  for  long  (2  s)  and  short  (250  ms)  stimuli,  respectively.  For  similar, 
relatively  long  wideband  stimuli,  Oldfield  and  Parker  (1984a)  and  Carlile  et  al.  (1997)  reported 
3.4%  and  3.2%  of  FB/BF  errors,  respectively,  although  in  the  first  study  BF  errors  dominated  FB 
errors  and  in  the  second  study  FB  errors  dominated  BF  errors.  FB/BF  errors  are  also  more 
common  for  speech  sound  sources  than  for  non-speech  wideband  sound  sources  (Gilkey  and 
Anderson,  1995). 

Usually,  FB  errors  dominate  BF  errors,  but  their  proportion  depends  on  the  visibility  of  the 
sound  sources,  the  type  of  listening  environment,  and  the  spectrum  of  the  emitted  sound  (e.g., 
Chasin  and  Chong,  1998).  However,  in  some  studies  the  number  of  reported  BF  errors  was 
greater  than  the  number  of  FB  errors  (e.g.,  Abouchacra  and  Letowski,  2001;  Moore,  2009). 
Abouchacra  and  Letowski  (2001)  used  speech  sounds  emitted  by  an  invisible  rotating 
loudspeaker  and  presented  in  either  non-directional  or  directional  background  noise.  In  both 
cases,  the  number  of  FB/BF  errors  was  dependent  on  the  SNR  and  was  largest  when  the  speech 
sound  source  was  located  at  +135°. 

The  number  of  FB/BF  errors  usually  rapidly  decreases  with  increasing  high-frequency  energy 
content  in  the  signal.  This  is  due  mostly  to  the  increasing  role  of  monaural  spectral  cues  in 
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perceiving  the  sound  source  location  (e.g.,  Giguere  and  Abel,  1993).  The  number  of  FB/BF 
errors  also  decreases  with  training  involving  signals  with  strong  monaural  cues  (Zahorik  et  ah, 
2006). 

Stevens  and  Newman  (1936)  observed  that  the  number  of  FB  confusions  among  their  listeners 
was  much  greater  for  low  frequency  sound  sources  (below  2.5  kHz)  than  for  high-frequency 
sound  sources  (above  2.5  kHz).  Similarly,  large  numbers  of  FB/BF  confusions  for  warning  siren 
sounds  below  2.5  kHz  were  reported  by  Withington  (1999).  For  narrow,  one-octave  wide  noise 
bands.  Burger  (1958)  reported  a  20%  rate  of  FB/BF  confusions.  For  a  jungle  environment, 
Dobbins  and  Kindick  (1967)  reported  18%,  21%,  and  14%  FB/BF  error  rates  for  pure  tones,  real- 
life  noises,  and  impact  and  impulse  sounds,  respectively.  An  exception  is  the  study  conducted  by 
Abel  and  Powlesland  (2010),  who  reported  a  rate  as  high  as  34%  for  FB/BF  confusions  for 
wideband  sound  sources  located  at  15°  above  or  below  the  interaural  line,  with  24%  being  FB 
errors  and  10%  being  BF  errors.  According  to  Kuhn  (1987),  the  effect  of  the  pinna  on 
localization  is  greatest  for  azimuths  of  about  50°  and  310°,  where  the  pinna  seems  to  act  as  a 
parabolic  reflector  and  greatly  differentiates  signals  coming  from  the  front  and  back.  Therefore, 
for  high  frequency  sound  sources  located  at  these  azimuths  the  number  of  FB/BF  errors  should 
be  the  smallest. 

Virtual  environments  tend  to  increase  the  number  of  front-back  confusions  and  rates  of  FB/BF 
errors  vary  as  much  as  12%-20%  for  individualized  HRTFs  and  15%-35%  for  non- 
individualized  HRTFs  (e.g.,  Bergault  and  Wenzel,  1993;  Besing  and  Koehnke,  1995; 

Bronkhorst,  1995;  Pedersen  and  Jorgensen,  2005;  Ricard  and  Meirs,  1994;  Wenzel  et  ah,  1993; 
Wightman  and  Kistler,  1989b).  For  example,  Wightman  and  Kistler  (1989b)  reported  5.6%  and 
1 1.0%  rates  of  FB  errors  in  a  free-field  and  virtual  field  (individualized  HRTFs),  respectively, 
for  the  same  group  of  listeners.  Similarly,  Wenzel  et  al.  (1991)  reported  19%  and  31%  for  the 
same  two  conditions. 

Typical  rates  of  FB  errors  in  virtual  environments  are  in  the  range  of  25%-35%  (Bergault  and 
Wenzel,  1993).  For  3-D  audio  presented  through  earphones,  Bergault  (1992)  reported  rates  of 
27.5%  for  dry  (anechoic-like)  and  33%  for  reverberant  synthetic  environments.  Similarly, 
Schonstein  et  al.  (2008)  reported  37.5%-52%  FB  error  rates  depending  on  the  type  of  earphone 
and  whether  the  frequency  response  was  equalized  or  not.  Wenzel  et  al.  (1993)  reported 
individual  error  rates  ranging  from  20%  to  43%  (32.0%  on  average;  25%  FB  and  6%  BF).  In 
another  study,  Wenzel  (1999)  reported  rates  of  only  5.2%-8.8%  for  front-back  and  1 1.3%- 
21.3%  for  up-down  confusions  (including  target  locations  close  to  the  horizontal  plane)  and 
26.2-36.3°  overall  LEs.  However,  regardless  of  listening  conditions,  FB  errors  seem  to  be  more 
numerous  than  BF  errors,  and  they  are  most  common  for  horizontal  locations  close  to  0°  (Carlile 
et  al.,  1997).  For  example,  Bergault  and  Wenzel  (1993)  presented  speech  signals  over  earphones 
using  non-individualized  HRTFs  and  reported  a  58%  rate  of  reversal  errors  consisting  of  47%  of 
FB  and  1 1%  of  BF  errors  for  0°  and  180°  target  sounds.  The  average  reversal-corrected  CE  in 
the  horizontal  plane  was  24.6°  for  the  0°  direction  and  27.9°  when  averaged  across  all  directions. 
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This  latter  error  is  only  slightly  larger  than  the  average  CE  of  20.5°  reported  by  Wightman  and 
Kistler  (1989b)  for  subject  SDO  listening  under  similar  conditions  with  her  own  HRTFs.  There 
are  also  large  individual  differences  in  the  case  of  non-individualized  HRTFs.  For  example, 
Ricard  and  Meirs  (1994)  reported  FB  error  rates  of  28.4%,  21.3%,  5.2%,  and  45.0%  for  their 
four  listeners. 

When  the  sound  source  is  located  at  lateral  positions,  practically  no  left-right  (right-left)  errors 
are  observed,  which  provides  evidence  of  the  strength  of  binaural  cues.  Similarly  there  are  very 
few  up-down  (down-up)  real  field  errors  reported  in  the  literature  that  are  outside  the  range  of 
uncertainty  at  the  interaural  plane,  and  there  is  no  overall  bias  toward  the  upper  or  lower 
hemisphere  (e.g.,  Wenzel  et  al.  1993;  Carlile  et  ah,  1997).  However,  the  number  of  up-down 
errors  in  virtual  space  may  be  substantial  depending  on  the  quality  of  the  simulation.  It  is 
noteworthy  that  when  they  do  occur,  most  up-down  errors,  in  real  or  virtual  space,  are  usually 
associated  with  simultaneous  BF  or  BF  error  (Makous  and  Middlebrooks,  1990;  Wenzel  et  al. 
1993). 

The  importance  of  knowing  the  sources  of  FB/BF  errors  and  using  the  proper  signal  design  to 
minimize  their  occurrence  is  best  seen  in  studies  of  human  reactions  to  emergency  vehicle  sirens. 
Both  Caelli  and  Porter  (1980)  and  Withington  (1999)  reported  a  very  high  rate  of  FB  errors  in 
response  to  ambulance  sirens.  In  fact,  the  study  participants  were  more  often  wrong  than  right. 
These  data  were  collected  across  several  types  of  emergency  sirens  including  hee-haw,  pulsar, 
wailing,  and  whooping  sounds.  Withington  (1999)  concluded  that  complex  sounds  characterized 
by  pulses  of  rapidly  rising  frequency  sweeps  followed  by  bursts  of  wideband  noises  led  to 
improved  localization. 

10.3  Treatment  of  Reversal  Errors 

Some  authors  (e.g.,  Cabot,  1977;  Gerzon,  1975;  Oldfield  and  Parker,  1984a,  Wightman  and 
Kistler,  1989b)  eliminate  reversal  errors  by  mirroring  the  perceived  reverse  locations  about  the 
interaural  axis  prior  to  data  analysis  in  order  to  preserve  the  sample  size.  Such  treatment  of 
reversal  errors  assumes  that  each  error  crossing  the  interaural  axis  consists  of  two  components: 
an  actual  error  and  reversal  component.  Consider  a  sound  source  located  at  40°  and  its  perceived 
location  at  1 10°.  In  this  case,  the  actual  error  is  assumed  to  be  30°  (shift  from  40°  to  70°)  and 
the  reversal  component  equal  to  an  additional  40°  (the  difference  between  1 10°  and  70°).  A 
rationale  offered  by  other  authors  is  that  in  real-world  environments,  where  visual  cues  interact 
with  auditory  cues,  reversal  errors  are  much  less  likely  to  happen  and  therefore  should  not  be 
taken  into  account  and  eliminated  from  the  data  set.  This  obviously  may  or  may  not  be  true 
depending  on  the  degree  to  which  the  sound  source  is  explicitly  visible.  However,  some  reversal 
errors  can  be  due  to  listeners’  expectations,  the  presence  of  baffling  headgear,  reflections  from 
the  environment,  the  type  of  sound  source  (see  section  2.5),  or  simply  a  shift  in  the  listener’s 
attention.  Nevertheless,  the  mirroring  of  the  reversed  estimates  may  in  general  decrease  the  size 
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of  the  average  localization  error  in  comparison  to  that  obtained  by  the  discarding  reversed  data 
points  but  the  actual  result  depends  on  the  specific  distribution  of  the  reversal  errors. 

The  extraction  and  separate  analysis  of  FB  errors  should  not  be  confused  with  the  process  of 
trimming  the  data  set  to  remove  outliers,  even  though  both  processes  may  have  the  same 
practical  effect.  Reversal  errors  are  not  outliers  in  the  sense  that  they  simply  represent  extreme 
errors.  They  represent  a  different  type  of  error  that  has  a  different  underlying  cause  and,  as  such, 
should  be  treated  differently.  Any  remaining  errors  that  differ  more  than  2.5  SDs  from  the  mean 
may  be  trimmed  (discarded)  or  winsorized  to  keep  the  data  set  within  a  reasonable  range. 
Winsorizing  is  a  strategy  in  which  the  extreme  values  are  not  removed  from  the  sample,  but 
rather  are  replaced  with  the  maximal  remaining  values  on  either  side.  This  strategy  has  the 
advantage  of  not  reducing  the  sample  size  for  statistical  data  analysis.  Both  these  procedures 
mitigate  the  effects  of  extreme  values  and  are  a  way  of  making  the  resultant  sample  mean  and 
standard  deviation  more  robust. 


11.  Categorical  Localization 


Another  method  of  determining  LE  is  to  ask  listeners  to  specify  the  sound  source  location  by 
selecting  from  a  set  of  specifically  labeled  locations.  These  locations  can  be  indicated  by  either 
visible  sound  sources  or  special  markers  on  the  curtain  covering  the  sound  sources  (Abel  and 
Banerjee,  1996;  Butler  et  ah,  1990;  Giguere  and  Abel,  1993;  Hammershoi  and  Sandvad,  1994; 
Hawley  et  ah,  1999).  Such  approaches  restrict  the  number  of  possible  directions  to  the 
predetermined  target  locations  and  lead  to  categorical  localization  judgments^s  (Perrett  and 
Noble,  1995).  The  results  of  categorical  localization  studies  are  normally  expressed  as  percents 
of  correct  responses  rather  than  angular  deviations.  For  example  Bienvenue  and  Siegenthaler 
(1974)  compared  the  binaural  and  monaural  abilities  of  listeners  to  distinguish  between  seven 
loudspeakers  distributed  around  them  as  the  source  of  a  projected  speech  signal  and  reported 
97%  and  52%  correct  response  rates,  respectively.  Abel  et  al.  (2007)  used  eight  loudspeakers 
unevenly  distributed  on  a  circle  (two  loudspeakers  in  each  spatial  quadrant)  in  comparing  various 
types  of  ear  occlusions  and  reported  94%  and  49%  correct  response  rates  for  open  ears  and  ears 
covered  with  passive  earmuffs  (Peltor  HlOA),  respectively. 

Although  categorical  localization  was  the  predominant  localization  methodology  in  older  studies 
(e.g.,  Bergman,  1957;  Bienvenue  and  Siegenthaler,  1974;  Kuyper  and  de  Boer,  1969),  it  is  still 
commonly  used  today  (Abel  and  Banerjee,  1996;  Inoue,  2001,  Macaulay  et  ah,  2010;  Van 
Hoesel  and  Clark,  1999;  Vause  and  Grantham,  1999).  For  example,  the  Source  Azimuth 
Identification  in  Noise  Test  (SAINT)  uses  categorical  judgments  with  a  circular  array  of  12 
loudspeakers  (Vermiglio  et  ah,  1998)  and  a  standard  system  for  testing  the  localization  ability  of 
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Hartmann  et  al.  (1998)  call  categorical  localization  judgments  “source-identification  method”  judgments. 
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cochlear  implant  users  is  categorical  with  8  loudspeakers  distributed  in  a  symmetric  manner  in 
the  horizontal  plane  in  front  of  the  listener  with  15.5°  of  separation  (Tyler  and  Witt,  2004). 


The  main  attractiveness  of  categorical  localization  studies  lies  in  the  fact  that  they  are  easy  to 
instrument  and  run.  In  many  industrial  and  clinical  settings  the  equipment  that  would  allow 
testing  via  a  non-categorical  paradigm  may  be  even  not  available.  However,  in  such  testing 
paradigm,  the  angular  distance  between  the  labeled  target  locations  may  be  de  facto  the 
resolution  of  the  localization  judgments  and  may  define  the  localization  precision  of  the  study. 
With  a  small  number  of  sound  sources,  the  resolution  is  very  poor,  and  with  a  large  number  of 
sound  sources,  the  numerous  labels  may  confuse  the  listener  since  they  could  be  hard  to  associate 
with  specific  directions.  Based  on  the  analysis  of  loudspeakers  arrays  conducted  by  Hartmann  et 
al.  (1998)  it  can  be  conjectured  that  the  minimum  number  of  sound  sources  in  a  loudspeaker 
array  spanning  no  more  than  180°  should  be  somewhat  larger  than  7-9.  Further  increase  in  the 
number  of  loudspeakers  within  a  given  arc  does  not  affect  much  the  size  of  localization  error.  If 
the  array  exceeds  180°,  the  specific  (asymmetrical)  locations  of  the  loudspeakers  in  the  front  and 
back  are  more  important  than  the  actual  number  of  loudspeakers  in  order  to  resolve  FB  and  BF 
confusions.  Due  to  these  limitations,  a  categorical  localization  paradigm  is  generally  suitable  for 
sound  sources  and  headgear  evaluation  studies,  and  should  not  be  in  general  used  in  research 
investigating  human  abilities. 

In  order  to  directly  compare  the  results  of  a  categorical  localization  study  to  an  absolute 
localization  study,  it  is  necessary  to  extract  a  mean  direction  and  standard  deviation  from  the 
distribution  of  responses  over  the  target  locations.  If  the  full  distribution  is  known,  then  by 
treating  each  response  as  an  indication  of  the  actual  angular  positions  of  the  selected  target 
location,  the  mean  and  standard  deviation  can  be  calculated  as  usual.  If  only  the  percent  of 
correct  responses  is  provided,  then  as  long  as  the  percent  correct  is  over  50%,  a  normal 
distribution  z-table  (giving  probabilities  of  a  result  being  less  than  a  given  z-score)  can  be  used  to 
estimate  the  standard  deviation.  If  J  is  the  angle  of  target  separation  (i.e.,  the  angle  between  two 
adjacent  loudspeakers),  p  is  the  percent  correct,  and  z  is  the  z-score  corresponding  to  (p+1  )/2, 
then  the  standard  deviation  is  given  by 


a  = 


2z 


(45) 


and  the  mean  by  the  angular  position  of  the  correct  target  location.  This  is  based  on  the 
assumption  that  the  correct  responses  are  normally  distributed  over  the  range  delimited  by  the 
points  half  way  between  the  correct  loudspeaker  and  the  two  loudspeakers  on  either  side.  This 
range  spans  the  angle  of  target  separation  (d)  and  thus  d/2  is  the  corresponding  z-score  for  the 
actual  distribution.  The  relationship  between  the  standard  z-score  and  the  z-score  for  a  normal 
distribution  N(p,a)  is  given  by: 


=  ju  +  cr-z 


(46) 
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In  this  case,  the  mean,  ji,  is  0  as  the  responses  are  centered  on  the  correct  loudspeaker  position, 
so  solving  for  the  standard  deviation  gives  equation  45.  As  an  example,  consider  an  array  of 
loudspeakers  separated  by  15°  and  an  85%  correct  response  rate  for  some  individual  speaker. 

The  z-score  for  (1  +  0.85)/2  =  0.925  is  1.44,  so  the  standard  deviation  is  estimated  to  be  7.5°/1.44 
=  5.2°. 

An  underlying  assumption  in  the  preceding  discussion  is  that  the  experimental  conditions  in  the 
categorical  task  are  such  that  the  listener  is  surrounded  by  evenly  spaced  target  locations.  If  this 
is  not  the  case,  then  the  results  for  the  extreme  locations  at  either  end  may  have  been  affected  by 
the  fact  that  there  are  no  further  locations.  In  particular  this  is  a  problem  when  the  location  with 
the  highest  percent  of  responses  is  not  the  correct  location  and  the  distribution  is  not  symmetric 
around  it.  For  example,  this  appears  to  be  the  case  for  the  speakers  located  at  +90°  in  the  30° 
loudspeaker  arrangement  used  by  Abel  and  Banerjee  (1996). 

Categorical  judgments  are  also  used  in  some  vertical  localization  studies.  Davis  and  Stephens 
(1974)  argued  that  at  very  low  intensity  levels  vertical  localization  becomes  very  difficult  and  it 
is  much  easier  for  the  listeners  to  make  their  judgments  when  a  range  of  possible  locations  is 
provided  to  them. 


12.  Directional  Audiometry 


Human  localization  ability  has  received  a  great  deal  of  attention  from  the  research  community 
over  the  last  100  years,  but  there  have  only  been  limited  efforts  to  develop  audiological  tests  of 
this  ability,  as  it  was  not  clear  what  value  directional  hearing  tests  would  provide  for  clinical 
diagnostics.  Sound  source  localization  is  an  important  task  repeated  many  times  each  day,  but 
the  large  variability  of  localization  data,  even  for  people  with  the  same  type  of  hearing  or  type  of 
hearing  loss  etiology  (e.g.,  Abel  et  ah,  1978;  Bocca  et  ah,  1955;  Nordlund,  1964;  Wilmington  et 
al,  1994),  as  well  as  technical  problems  with  creating  spatially  uniform  testing  conditions, 
hampered  progress  in  developing  standardized  directional  hearing  tests.  In  addition,  the  obvious 
causes  of  localization  performance  degradation  related  to  the  mechanics  of  the  outer  and  middle 
ears  (e.g.,  atresia,  otitis  media,  otosclerosis)  can  be  determined  without  localization  tests,  so  there 
was  no  need  for  spatial  audiometric  tests  for  these  purposes.  Accordingly,  only  limited  sound 
field  tests  have  been  developed  for  testing  children  and  hearing  aid  evaluation  using  a  single 
loudspeaker  located  either  close  to  the  listener’s  ears  (<25  cm)  or  about  1  m  away  at  0°,  45° 
and/or  90°  angles  (e.g.,  ASHA,  1991;  Goldberg,  1979;  1981;  Walker  et  al.,  1984).  In  some  cases 
two  loudspeakers  were  used. 

Consistent  reports  by  people  who  are  hard  of  hearing,  as  well  as  by  some  normal  hearing 
listeners,  of  difficulties  in  localizing  sound  sources  outside  their  field  of  view  and 
comprehending  speech  coming  from  behind  contributed  in  the  end  to  efforts  to  develop 
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directional  hearing  tests  for  clinical  practice.  The  clinical  importance  of  LE  was  also  supported 
by  the  growing  diagnostic  value  of  sound  lateralization  tests  conducted  under  earphones  (e.g., 
Almqvist  et  ah,  1989;  Furst  et  ah,  2000).  Equally  importantly,  auditory  localization  tests  also 
became  useful  for  testing  the  directional  properties  of  hearing  aids,  especially  those  with 
directional  microphones,  assistive  technology,  and,  more  recently,  for  testing  the  directional 
hearing  restoration  of  cochlear  implants  users. 

Directional  hearing  tests  to  assess  hearing  deficiencies,  commonly  referred  to  in  the  medical 
community  as  directional  audiometry  or  spatial  audiometry,  seem  to  have  originated  from  the 
work  of  Goodhill  (1954),  Hahlbrock  et  al.  (1959),  Jongkees  and  Groen  (1946),  Jongkees  and 
Veer  (1957),  and  Sanchez-Eongo  et  al.  (1957).  In  the  early  1970s,  Tonning  published  a  series  of 
eight  papers  (Tonning,  1970,  1971ab,  1972abc,  1973ab)  on  the  development  and  use  of 
directional  hearing  tests  for  audiological  applications.  It  is  noteworthy  that  six  of  Tonning’s 
papers  are  related  to  directional  speech  intelligibility  (DSI)  testing  and  only  two  of  them 
(Tonning,  1970;  1973b)  address  localization  issues.  Other  publications  proposing  some  forms  of 
directional  audiometry  included  Nordlund  (1962ab,  1964),  Eink  and  Eehnhardt  (1966), 
Bienvenue  and  Siegenthaler  (1974),  Cook  and  Frank,  1977;  Newton  and  Hickson  (1981),  Zera  et 
al.  (1982),  Noble  et  al.,  (1994),  and  Besing  et  al.  (1999b).  In  all  these  cases  directional 
audiometry  was  limited  to  sound  source  localization  in  the  horizontal  plane.  Over  the  years, 
three  basic  forms  of  directional  audiometry  testing  have  emerged: 

1.  The  listener  is  surrounded  by  loudspeakers,  and  the  loudspeakers  and  listener  both  remain 
stationary  (Abouchacra  et  al.,  1998a;  Bienvenue  and  Siegenthaler,  1974;  Cook  and  Frank, 
1977). 

2.  The  listener  is  surrounded  by  loudspeakers  but  rotates  their  chair  toward  the  incoming 
sound  (Hahlbrock  et  al.  1959;  Eink  and  Eehnhardt,  1966). 

3.  A  single  loudspeaker  rotates  (or  can  be  rotated)  around  a  stationary  listener  (Elfner  and 
Howse,  1987;  Newton  and  Hickson,  1981b;  Nordlund,  1964;  Sanchez-Eongo  et  al.  1957; 
Zabrewski,  1960). 

Some  examples  of  the  technical  arrangements  used  in  all  three  forms  of  directional  audiometry 
testing  are  listed  in  table  3. 
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Table  3.  Some  examples  of  the  technical  arrangements  used  in  directional  audiometry  tests. 


Author 

Test 

Form 

Signal 

Level 
(dB  SPL) 

Comments 

Sanchez -Longo  et 
al.  (1957) 

(c) 

60-Hz  tone  with 
harmonics 

90  dB  HL 

13  positions  separated  by  15°  (+90°  range); 
single  loudspeaker  hidden  behind  semicircular 
screen  and  moved  by  hand;  T<10  s. 

Link  and 

Lehnhardt  (1966) 

(b) 

white  noise 

13  fixed  loudspeakers  separated  by  10°  (+60° 
range);  response  interval:  1.7°  using  a 
directional  table',  signal  duration:  short  impulse 

Hattori  (1966) 

(a) 

narrow  band  noise 
(1  kHz) 

3  fixed  loudspeakers  (0°,  +90°);  phantom  sound 
sources  at  +5°,  10°,  15°,  20°,  25°,  40°,  50°,  60°, 
and  70° 

Tonning  (1970) 

(c) 

white  noise 

65  dB  SPL 

2  rotating  loudspeakers;  signal  duration  10  s; 
response  markers:  2.5° 

Bienvenue  and 

Siegenthaler 

(1974) 

(a) 

speech  phrase 
“where  is  this” 

15  dB  SL 

7  fixed  loudspeaker  mounted  on  the  ceiling  and 
separated  by  45°  (no  front  loudspeaker); 
response:  loudspeaker  number 

Humes  et  al. 

(1980) 

(a) 

0.5-  and  3.0-kHz 
pure  tones 

60  dB  SPL 

13  loudspeakers  separated  by  15°  (+90°  range); 
response:  loudspeaker  number 

Newton  and 

Hickson 

(1981) 

(c) 

0.5-kHz  tone  and 
narrow  band  noise 
(500  Hz) 

random  angle  in  the  64°  to  117°  range;  signal 
duration  5  s;  response:  spoken  angular  estimate 

Vermiglio  et  al., 
1998  (SAINT  test) 

(a) 

various  natural 
sounds 

55  dB  A 

12  loudspeakers  unequally  distributed  over 

360°;  1  overhead  loudspeaker  for  masker 
presentation  (SNR  =  -5  dB) 

Besing  et  al. 

(1999b) 

(a) 

speech  phrases 

50  or  70 
dB  SPL 

2  loudspeakers;  9  spatialized  phantom  locations 
separated  by  10°  (+40°  range); 

Besing  et  al. 

(1999b) 

(a) 

speech  phrases 

50  or  70 
dB  SPL 

8  loudspeakers  separated  by  20°  (+70°  range); 
no  front  loudspeaker 

Despite  the  several  proposed  testing  configuration  and  data  collection  procedures,  the  clinical 
community  has  not  yet  agreed  on  a  single  standard  clinical  procedure  for  evaluating  directional 
hearing.  The  unresolved  issues  include  the  technical  requirements  for  the  test  system, 
comprehensive  yet  flexible  test  procedures,  and  most  importantly,  normative  data  for  directional 
hearing.  However,  there  is  some  progress  toward  standardization.  For  example,  there  seems  to 
be  a  consensus  that  the  two  best  directional  audiometry  signals  are  low -pass  (up  to  0. 5-1.0  kHz) 
and  high-pass  (above  2-4  kHz)  white  noise  signals  that  can  separately  test  temporal  and 
intensity-based  elements  of  spatial  hearing.  Another  possibility  is  to  use  octave-wide  bands  of 
noise.  A  method  of  delivering  directional  signals  by  rotating  loudspeaker  has  gained  some 
popularity  (e.g.,  Abel  et  ah,  1982;  Comalli  and  Altshuler,  1980).  As  a  clinical  criterion  of 
normal  localization  ability  (horizontal  plane;  frontal  location),  the  localization  accuracy  of  10° 
has  been  suggested  (e.g.,  Comalli  and  Altshuler,  1980). 

The  main  question  for  directional  audiometry,  however,  remains:  How  can  localization  data  be 
linked  to  specific  health  issues?  This  was  originally  a  question  without  a  clear  answer,  and  the 
view  that  the  relationship  between  hearing  loss  and  auditory  directional  sensitivity  was  only 
moderate  was  commonly  held  (e.g..  Noble  et  ah,  1994).  However,  by  including  listeners  with  a 
variety  of  hearing  disorders  in  localization  studies,  the  research  community  is  learning  more  and 
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more  about  the  potential  links  between  localization  ability  and  specific  hearing  disorders.  The 
first  study  of  this  kind  seems  to  have  been  carried  out  by  Greene  (1929),  who  suggested  that 
lesions  in  the  temporal  lobe  may  be  related  to  poor  directional  hearing.  Similar  conclusions  were 
reached  by  Sanchez-Longo  et  al.  (1957)  and  Jerger  et  al.  (1969).  Degraded  directional  hearing 
has  also  been  reported  in  cases  of  otosclerosis  (Jongkees  and  Veer,  1957;  Newton  and  Hickson, 
1981;  Nordlund  1962ab,  1964)  and  acoustic  neuroma  (Abel  et  al.,  1982;  Liden  and  Korsan- 
Bengsten,  1973;  Newton  and  Hickson,  1981).  There  is  also  growing  evidence  that  directional 
audiometry  can  help  differentiate  between  most  cochlear  and  some  cortical  lesions,  and  lesions 
in  the  middle  ear,  cochlear  nerve,  and  retrocochlear  (points)  region.  The  former  cause  no 
directional  hearing  deficit,  whereas  the  latter  result  in  impaired  directional  hearing  (e.g., 
Nordlund,  1964;  Azzi,  1964).  Further,  changes  in  listeners’  localization  patterns  may  help  to 
differentiate  brain  lesions  at  the  SOC  and  IC  levels  (e.g.,  Aharonson  and  Furst,  2001)  (see 
section  4).  However,  it  is  still  unclear  to  what  degree  lesions  of  the  vestibular  system  affect 
directional  hearing  (Diamant,  1946;  Jongkees  and  Veer,  1958;  Nordlund,  1964;  Tonning,  1975). 
Since  the  listener  remains  stationary  during  sound  presentation  in  directional  audiometry,  it  is 
very  unlikely,  as  Blauert  (1974/2001)  points  out,  that  such  testing  can  reveal  any  disorder  of  the 
vestibular  system.  There  are,  however,  some  reports  indicating  that  very  strong  sounds  and  head 
vibrations  may  elicit  a  response  from  the  vestibular  system  even  if  the  person  remains  stationary 
(Parker  et  al.  1968;  Parker  and  Gierke,  1971).  A  review  of  older  literature  on  the  effects  of 
hearing  disorders  on  directional  hearing  may  be  found  in  Durlach  et  al.  (1981). 

Both  the  research  and  clinical  communities  are  aware  that  some  of  the  differences  in  the  reported 
effects  of  lesion  site  on  patients’  auditory  localization  ability  may  be  due  to  superficial 
differences  in  the  acoustics  of  the  spaces  used  in  directional  audiometry  and  the  lack  of 
consistency  and  clarity  regarding  the  test  criteria.  For  example,  in  the  sound  field  studies  geared 
toward  the  development  of  directional  audiometry  tests,  a  head  restraint  should  be  used  in  order 
to  minimize  potential  contributions  of  dynamic  cues  that  can  confound  the  findings  (see 
section  2.3).  This  has  not  always  been  the  case  in  the  reported  studies.  Similarly,  some  authors 
reported  LE,  while  others  reported  CE  or  RE,  and  in  many  cases,  the  type  of  error  reported  by 
the  authors  was  not  clear.  However,  the  type  of  EE  made  by  a  listener  is  very  important  in 
clinical  evaluation.  According  to  Nordlund  (1962ab,  1964),  Newton  and  Hickson  (1981),  and 
Abel  et  al.  (1982),  abnormality  in  RE,  that  is,  a  greater  than  normal  inconsistency  of  localization 
responses,  constitutes  diagnostic  evidence  of  hearing  problems,  especially  of  sensorineural 
hearing  loss.  These  authors  also  argued  that  CE  toward  either  direction  has  little  diagnostic  value 
in  determining  the  potential  site  of  a  lesion.  In  contrast,  according  to  Abel  et  al.  (1982),  persons 
with  neuroma  tend  to  make  CEs  by  shifting  the  perceived  image  toward  the  unimpaired  ear. 

Such  persons  may  also  have  problems  distinguishing  sound  source  positions  on  either  side  of  the 
median  plane  (Abel  et  al.,  1982). 

There  have  only  been  a  few  attempts  to  extend  directional  audiometry  to  vertical  localization. 
The  first  reported  attempt  was  most  likely  made  by  Walsh  (1957),  who  reported  that  in  a  number 
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of  brainstem  and  cerebral  lesion  cases,  horizontal  localization  ability  remained  intact  while 
vertical  localization  accuracy  (CE)  was  noticeably  affected.  Interest  in  clinical  testing  of  vertical 
localization  may  increase  with  the  development  of  virtual  directional  audiometry,  which  would 
allow  easy  presentation  of  phantom  sound  sources  from  any  angle  in  3-D  space  (e.g.,  Bergault, 
1992;  Besing  et  ah,  1999b).  Such  audiometry  is  based  on  synthetic  out-of-the-head  spatial  audio 
environments  (AVR  environments)  presented  through  earphones  (e.g.,  Abouchacra  et  ah,  1998b; 
Besing  and  Koehnke,  1995;  Besing  et  ah,  1999a;  Koehnke  and  Besing,  1996;  1997ab;  Vermiglio 
et  ah,  1998).  The  tests  proposed  by  various  authors  include  DDT  tests,  localization  accuracy 
tests,  and  speech-in-noise  (cocktail  party  effect)  tests.  Speech-in-noise  tests  include  both 
directional  and  ambient  noise  maskers  (e.g.,  Abouchacra  and  Letowski,  2004;  2005;  Abouchacra 
et  ah,  2009).  Both  the  tests  for  adult  and  children  populations  have  been  proposed  (e.g.,  Besing 
et  ah,  1998).  Some  common  elements  of  the  virtual  directional  audiometry  tests  proposed  to  date 
include  the  use  of  speech  test  signals  and  out-of-the-head  phantom  sound  source  locations 
separated  by  22.5°.  Note  that  previous  attempts  to  use  earphones  without  virtual  out-of-the  head 
spatialization  failed  due  to  in-the-head  localization,  which  is  both  unnatural  and  inaccurate  in 
resolving  phantom  sound  source  locations  (e.g.,  Nordlund,  1962b).  Further  improvements  in  the 
standardization  of  directional  audiometry  may  result  from  the  standardization  efforts  of  the 
American  National  Standards  Institute  (ANSI),  which  established  two  working  groups, 

S3AV G83  and  S3AV G89,  to  evaluate  the  feasibility  of  natural  and  virtual  directional  audiometry, 
and  develop  unified  procedures  for  directional  hearing  tests  in  both  real  and  virtual  spaces. 

In  the  only  study  of  its  kind  to  date,  Vermiglio  et  al.  (1998)  compared  real  sound  field 
(loudspeakers;  eight  sources)  and  virtual  sound  field  (earphones;  six  sources)  versions  of  their 
SAINT  test  (see  table  3)  and  concluded  that  although  the  headphone  test  was  less  sensitive  than 
the  loudspeaker  test  (with  the  difference  attributed  to  the  fewer  number  of  sound  sources),  both 
tests  demonstrated  similar  test-retest  reliability.  This  is  an  important  finding  since  regardless  of 
the  advances  of  virtual  earphone-based  directional  audiometry,  free-field  audiometry  will  always 
be  required  for  testing  the  effects  of  hearing  aids,  hearing  protectors,  and  other  headgear  on 
people’s  ability  to  identify  the  direction  of  incoming  sounds. 


13.  Localization  of  Multiple  Sound  Sources 


Most  auditory  localization  studies  to  date  have  focused  on  the  localization  of  a  single  sound 
source  either  in  isolation  or  with  a  more  or  less  complex  acoustic  background  environment  (see 
section  9).  However,  our  daily  listening  situations  are  much  more  complex  than  those  and  can 
require  that  we  pay  attention  to  more  than  one  sound  source  at  a  time.  For  example,  a  blind 
person  walking  in  the  street  must  pay  attention  to  several  sound  sources  in  order  to  walk  safely 
and  effectively.  While  selective  attention  tasks  where  the  listener  focuses  on  a  specific  sound 
source  are  well  researched  in  the  psychoacoustic  literature,  divided  attention  tasks  are  not  often 
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addressed,  and  very  few  studies  to  date  considered  situations  in  which  a  listener  had  to  identify 
and/or  localize  two  or  more  simultaneously  active  sound  sources. 

The  simultaneous  localization  of  two  or  more  sound  sources  located  at  different  positions  in 
space  is  very  demanding  task,  especially  if  there  is  complete  or  even  partial  overlap  between  the 
spectral  and  temporal  patterns  of  the  emitted  sounds.  When  sounds  produced  by  two  or  more 
sound  sources  have  similar  sound  onset  and  harmonic  structure,  they  may  be  fused  into  one  event 
with  a  single  real  or  virtual  source  of  origin.  This  fusion  effect  results  from  the  rules  of  auditory 
scene  analysis  (ASA)  performed  by  the  listener’s  central  auditory  system  (Bregman,  1990; 
Shinn-Cunningham  and  Durlach,  1994;  Woods  and  Colburn,  1992).  One  common  example  of 
such  a  fusion  effect  is  the  precedence  effect  (see  section  2.4).  In  general,  if  two  or  more  sound 
sources  are  synchronously  presenting  similar  (e.g.,  harmonically  related)  sounds  from  different 
locations  in  space,  their  timing  serves  as  a  grouping  cue  and  only  the  location  of  the  lowest 
frequency  sound  is  perceived  if  all  sounds  arrive  at  the  same  time  (Best  et  ah,  2007).  Therefore, 
to  facilitate  localization  of  two  or  more  simultaneously  active  sound  sources  located  at  the  same 
distance  from  the  listener,  the  sources  have  to  be  both  well- separated  in  the  space  and  emit 
sounds  that  are  easy  to  distinguish  by  the  listener  (Bregman,  1990). 

The  first  attempt  to  measure  a  threshold  for  distinguishing  between  the  locations  of  two 
concurrently  active  sound  sources  was  reported  by  Perrott  (1984a),  who  referred  to  this  threshold 
as  the  concurrent  minimum  audible  angle  (CMAA).  Perrott  presented  two  simultaneous  tones  of 
differed  pitch  from  two  sound  sources  located  in  the  horizontal  plane  and  asked  listeners  to 
report  if  the  higher  tone  was  located  to  the  left  or  to  the  right  of  the  lower  tone  (Perrott,  1984a). 
The  CMAA  values  reported  for  a  75%  correct  identification  rate  varied  from  5°-10°  at  the 
frontal  location  to  as  much  as  30°^5°  for  a  lateral  azimuth  of  67°.  Similar  data  were  reported 
by  Divenyi  and  Oliver  (1989)  for  amplitude-  and  frequency-modulated  tones  and  Best  et  al. 
(2004)  for  broadband  sounds.  Results  of  all  these  studies  indicate  that  pitch  similarity  and 
spectral  overlap  decrease  the  resolution  of  concurrent  sounds  and  increase  the  CMAA  value. 

Hollander  (1994)  measured  the  CMAA  at  the  frontal  direction  using  harmonic  complexes  that 
differed  in  their  fundamental  frequency  (1000  and  1050  Hz)  and  reported  much  poorer  spatial 
resolution  than  Perrott  (1984ab)  and  Divenyi  and  Oliver  (1989).  He  also  observed  large 
intersubject  variability  in  the  results.  Among  the  seven  listeners  in  the  study,  horizontal  and 
vertical  CMAAs  varied  from  20°  to  60°  and  20°  to  80°,  respectively.  Best  et  al.  (2003)  modified 
the  CMAA  paradigm  by  presenting  two  identical  broadband  sounds  from  two  loudspeakers  and 
asking  listeners  whether  they  heard  the  sound  as  coming  from  a  single  location  or  from  two 
distinct  locations  either  in  azimuth  or  elevation.  The  spatial  resolution  data  in  the  horizontal 
plane  were  poorer  but  qualitatively  similar  to  those  obtained  by  Perrott  (1984ab)  and  Divenyi 
and  Oliver  (1989).  The  source  separation  needed  to  spatially  resolve  two  sources  was  location- 
dependent  and  varied  from  21°  in  front  of  the  listener  to  about  45°  at  a  90°  lateral  angle.  For  two 
concurrent  sound  sources  located  at  different  elevations,  listeners  were  practically  unable  to 
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separate  them  in  the  median  plane  but  could  discriminate  between  them  at  lateral  angles,  e.g.,  at 
the  frontal  plane,  when  the  angular  separation  exceeded  50°-60°. 

The  first  experiment  (that  we  are  aware  of)  involving  the  simultaneous  localization  of  several 
well-separated  sound  sources  was  reported  by  Rowell  and  Kay  (1968),  who  presented 
blindfolded  listeners  with  five  different  sound  sources  (loudspeakers)  emitting  the  same  signal 
and  asked  them  to  identify  the  number  and  location  of  the  sound  sources.  Obviously,  this  was  an 
impossible  task  as  (in  effect)  the  listeners  only  heard  one  sound  coming  from  a  single  location 
that  changed  as  they  moved  in  the  space.  The  purpose  of  this  experiment  was  to  prove  that 
multiple  simultaneously  active  sound  sources  must  have  very  different  characteristics  to  be  heard 
separately. 

Parmentier  and  Jones  (2000)  conducted  a  study  in  which  listeners  were  asked  to  remember  and 
recall  a  sequence  of  sounds  presented  in  random  order  by  nine  loudspeakers  placed  at  40° 
intervals  around  the  listener.  The  authors  reported  the  presence  of  primacy  and  recency  effects, 
resulting  in  a  large  number  of  errors  in  which  listeners  erroneously  selected  the  loudspeaker  that 
had  emitted  the  preceding  sound  instead  of  the  loudspeaker  emitting  the  current  sound.  In 
contrast,  very  few  spatial  errors,  that  is,  the  selection  of  an  adjacent  loudspeaker  instead  of  the 
correct  one,  were  reported.  In  a  similar  study,  Klatzky  et  al.  (2002)  presented  three  or  five  words 
in  sequence  from  three  or  five  loudspeakers  placed  at  least  30°  apart.  Each  word  was  presented 
through  a  specific  loudspeaker,  and  the  listeners’  task  was  to  associate  specific  words  with 
specific  sound  sources.  The  authors  reported  that  the  listeners  learned  the  task  more  quickly  for 
three  than  five  spatially  separated  word/loudspeaker  combinations. 

The  first  study  in  which  listeners  were  actually  asked  to  simultaneously  localize  multiple  sources 
concurrently  presenting  different  sounds  was  done  by  Brungart  et  al.  (2005).  Listeners  were 
asked  to  localize  the  sources  of  up  to  14  different  broadband  continuous  noises.  The  individual 
sources  were  turned  on  in  sequence,  and  each  time  a  new  source  was  added  the  listener  was 
asked  to  identify  its  location.  Localization  accuracy  declined  steadily  with  increasing  number  of 
active  sound  sources  but  remained  higher  than  chance  even  when  all  14  sound  sources  had  been 
turned  on.  Head  movements  were  found  to  be  helpful  in  the  localization  task  for  up  to  five  active 
sound  sources  but  not  beyond  that  level. 

A  group  of  concurrent  sound  sources  was  also  used  in  the  studies  by  Simpson  et  al.  (2007)  and 
Santala  and  Pulkki  (201 1).  Simpson  et  al.  presented  n  concurrent  non-speech  sounds  and  then 
eliminated  one  of  the  sources  and  asked  the  listeners  to  indicate  where  the  eliminated  sound 
source  had  been  located.  The  LE  was  on  the  order  of  5°  for  n  =  2,  10°  for  n  =  4,  25°  for  n  =  6, 
and  35°  for  n  =  8  (for  the  sounds  and  conditions  used  in  the  study).  Santala  and  Pulkki  presented 
uncorrelated  pink  noise  bursts  through  groups  of  1  to  13  loudspeakers  (1,  2,  3,  4,  5,  7,  11,  13) 
distributed  in  the  frontal  horizontal  plane  and  asked  their  listeners  to  identify  all  the  loudspeakers 
emitting  the  sound  at  the  given  time.  The  general  conclusion  that  emerged  from  the  study  was 
that  the  listeners  were  unable  to  identify  the  spatial  details  of  the  sound  field  when  there  were 
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more  than  three  loudspeakers  emitting  sound  concurrently.  Note  that  in  this  study,  the  listeners’ 
task  involved  focusing  on  all  sound  sources  simultaneously  rather  than  on  one  of  them  at  a  time 
as  in  the  Brungart  et  al.  (2005)  and  Simpson  et  al.  (2007)  studies. 

Martin  et  al.  (201 1)  presented  listeners  with  up  to  six  sources  of  environmental  sounds 
positioned  around  the  listener  in  an  AVR  space.  The  sequence  of  1  to  6  sounds  was  presented  1, 
3,  or  5  times,  and  the  target  sound  was  revealed  after  the  presentation  of  the  last  sequence.  The 
listener’s  task  was  to  identify  the  location  of  the  sound  source  that  produced  this  sound.  As  in 
the  previous  studies  mentioned,  pronounced  primacy  and  recency  effect  were  found.  Further 
research  in  this  area  is  needed  to  determine  the  human  ability  to  localize  two  or  more  sound 
sources  that  simultaneously  (or  within  a  short  time  frame)  produce  sounds  of  short  duration  and 
to  determine  the  limitations  of  spatial  auditory  attention  in  construing  auditory  awareness  of  the 
surrounding  environment. 


14.  Perception  of  Moving  Sound  Sources 


Our  ability  to  perceive  motion  is  very  important  in  our  ongoing  interactions  with  the  surrounding 
world  and  is  the  key  to  our  ability  to  detect  and  avoid  threats.  Both  the  visual  and  auditory 
senses  can  detect  and  monitor  the  motion  of  objects  moving  along  various  trajectories  if  their 
motion  is  relatively  slow  (Stern  et  al.,  2006).  A  person  can  discriminate  direction  of  motion, 
estimate  distance  travelled,  and  assess  velocity  of  the  tracked  object.  In  addition,  tracked  objects 
can  rotate  (turn  to  the  left  or  right),  tilt  (pivot)  toward  one  side  or  the  other,  and/or  tumble  (turn 
up  or  down),  that  is,  make  changes  in  their  relative  yaw,  roll,  and  pitch,  each  of  which  can  affect 
both  senses  of  motion  perception. 

The  two  main  cues  that  enable  a  listener  to  track  the  direction  of  a  moving  sound  source  are 
angular  velocity  and  radial  velocity  cues.  Other  variables  affecting  perception  of  movement 
include  distance  from  the  listener,  Doppler  frequency  shift,  sound  intensity,  and  interaural 
differences  (e.g.,  Ericson,  2000;  Rosenblum  et  al.,  1987).  Angular  velocity  is  the  velocity  at 
which  the  sound  source  rotates  around  the  listener,  while  radial  velocity  is  the  velocity  at  which 
it  moves  toward  or  away  from  the  listener.  Movement  of  the  sound  source  toward  (or  away  from) 
the  listener  causes  changes  in  the  sound  intensity  perceived  by  the  listener  and  produces  a 
frequency  shift  in  the  perceived  spectrum  of  the  moving  sound  due  to  the  Doppler  effect.  Sound 
waveforms  produced  by  a  sound  source  moving  toward  the  listener  become  compressed  along 
the  axis  of  movement,  which  results  in  a  higher  effective  sound  frequency.  When  a  sound  source 
moves  away  from  the  listener,  the  effect  is  reversed,  and  the  effective  sound  frequency  is  lower. 
Mathematically,  the  effect  is  expressed  as  follows: 


fo 


fs 
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where /o  is  the  frequency  of  the  sound  as  heard  by  the  listener, the  frequency  of  the  sound 
produced  by  the  sound  source,  c  the  velocity  of  sound  in  the  medium,  the  velocity  of  the  sound 
source  relative  to  the  medium,  and  Vo  the  velocity  of  the  observer  relative  to  the  medium.  The 
convention  regarding  positive  (+)  and  negative  (-)  directions  of  movement  as  used  in  equation 
47  is  shown  in  figure  16. 
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Figure  16.  Positive  (+)  and  negative  (-)  directions  of  movement 
used  in  the  formula. 


The  Doppler  effect  causes  a  semitone  shift  in  sound  spectrum  for  each  change  of  42  mph  (67.6 
km/h)  in  relative  velocity.  When  a  sound  source  travels  along  a  trajectory  that  does  not  intersect 
the  listener’s  location,  the  radial  velocity  (vots)  of  the  source  varies  as  a  function  of  the  angle,  a, 
between  the  direction  of  the  source’s  velocity  {v source)  and  the  line  connecting  the  source  with  the 
listener  as 


''obs  ^source 


(48) 


Note,  however,  that  some  authors  (e.g.,  Laroche,  1994)  erroneously  state  that  the  Doppler 
frequency  increases  as  the  source,  moving  at  constant  speed,  approaches  an  observer  and  then 
decreases  as  it  passes  the  observer.  In  reality,  as  the  sound  source  approaches  the  listener,  the 
Doppler  frequency  is  higher  than  the  emitted  frequency  but  does  not  change  until  the  sound 
source  passed  the  listener  at  which  point  it  drops  to  a  value  below  the  emitted  frequency  and 
again  does  not  change  as  it  moves  away  from  the  listener.  Bohren  (1991)  argued  that  the 
perception  of  increasing  frequency  for  the  approaching  sound  source  is  the  effect  of  increasing 
sound  intensity  as  the  sound  source  nears  the  listener,  which  is  misinterpreted  as  an  increase  in 
signal  frequency. 

Although  changes  in  distance  may  be  cued  by  sound  intensity  differences  or  by  the  Doppler  shift 
in  sound  frequency,  changes  in  vertical  and  horizontal  angle  are  cued  by  binaural  and  monaural 
localization  cues.  The  primary  metric  used  in  reporting  perceived  sound  source  motion  is  the 
minimum  audible  movement  angle  (MAMA).  The  MAMA  is  defined  as  the  smallest  angular 
distance  the  sound  source  has  to  travel  so  that  its  direction  of  motion  is  detected.  In  other  words, 
the  MAMA  is  the  detection  threshold  for  movement,  whereas  the  MAA  is  the  detection  threshold 
for  location.  The  MAMA  is  usually  larger  than  the  MAA,  typically  twice  as  large,  for  the  same 
sound  source  and  the  same  initial  (reference)  direction  and  is  independent  of  direction  of 
movement  in  the  horizontal  plane  (e.g..  Chandler  and  Grantham,  1992;  Grantham  et  ah,  2003; 
Perrott  and  Musicant,  1977)  and  signal  intensity  (Perrott  and  Marlborough,  1989).  Similarly  to 
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the  MAA,  the  MAMA  is  smallest  in  front  of  the  listener  and  increases  as  the  sound  source  moves 
away  from  the  listener  laterally  (Harris  and  Sergeant,  1971;  Grantham,  1986);  is  smaller  for 
wide-  than  narrow-band  stimuli  (Harris  and  Sergeant,  1971;  Saberi  and  Perrot,  1990);  and  is 
largest  in  the  mid-high  frequency  range  (Perrott  and  Tucker,  1988).  In  general,  MAMAs  are 
U-shaped  functions  of  velocity,  with  optimum  resolution  obtained  at  about  8°-16°/s  in  the 
horizontal  plane  and  7°-10°/s  in  the  vertical  plane  (Saberi  and  Perrott,  1990). 

At  low  horizontal  angular  velocities  (below  20°/s),  the  MAMA  at  the  midline  (0°)  is  relatively 
small  (on  the  order  of  2-8°;  Perrott  and  Marlborough  [1989]  reported  1°)  but  becomes  larger 
(10°-20°)  as  velocity  increases  (Carlile  and  Best,  2002;  Chandler  and  Grantham,  1992;  Harris 
and  Sergeant,  1971;  Perrott,  1982;  Perrott  et  ah,  1993;  Saberi  and  Perrott,  1990).  Grantham 
(1997)  reported  MAMAs  of  4.8°  and  7.8°  at  velocities  of  20°/s  and  60°/s,  respectively.  A 
MAMA  of  20°  was  also  reported  for  a  velocity  of  180°/s  by  Chandler  and  Grantham  (1992)  and 
for  a  velocity  of  360°/s  by  Grantham  (1986)  and  Perrott  and  Musicant  (1977).  Strybel  et  al. 
(1992b)  reported  that  at  a  velocity  of  20°/s,  the  initial  position  of  the  sound  source  did  not 
significantly  affect  the  MAMA  for  azimuth  locations  in  the  +40°  range.  Within  this  range,  and  at 
elevations  below  80°,  the  MAMAs  were  surprisingly  small  (1-2°)  but  increased  to  3-10°  outside 
of  this  range.  However,  Chandler  and  Grantham  (1992)  reported  MAMAs  being  1.5  to  3.0  times 
larger  at  a  60°  azimuth  than  at  the  midline  (0°).  Some  variability  in  the  reported  data  may  be 
caused  by  the  degree  of  spatial  adaptation  to  the  initial  position  of  the  subsequently  moving 
sound  source  available  to  the  listener  (Getzman  and  Lew  aid,  2011). 

For  sound  source  velocities  exceeding  10°/s,  the  horizontal  MAMA  is  linearly  related  to  the 
sound  source  velocity  (Chandler  and  Grantham,  1992;  Perrott  and  Musicant,  1977).  This  means 
that  a  certain  minimum  amount  of  sound  source  movement  in  this  velocity  range  is  required  for 
the  listener  to  detect  and  process  changes  in  sound  source  location  (Scharine  et  al.,  2009).  In 
other  words,  the  MAMA  is  a  displacement  threshold,  that  is,  the  minimum  noticeable 
displacement  of  a  sound  source  moving  at  a  constant  velocity.  Note  that  the  MAMA  is  a  product 
of  sound  source  velocity  and  the  duration  of  movement  (stimulus  duration).  According  to 
Chandler  and  Grantham  (1992),  this  minimum  noticeable  angular  displacement  corresponds  to  a 
period  of  observation  (minimum  duration)  varying  from  about  300  ms  (target  at  midline)  to 
about  1200  ms  (target  at  60°),  except  for  very  high  sound  source  velocities  (above  about  100°/s). 
Altman  and  Andreeva  (2004)  reported  a  minimum  duration  of  150-200  ms  in  the  0°-60°  range 
of  observation  angles  and  ~25%-30%  longer  durations  at  larger  angles  for  sound  sources  moving 
at  low  velocities.  The  general  relationship  between  the  MAMA,  sound  source  velocity,  and 
duration  of  movement  is  shown  in  figure  17. 
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Figure  17.  Relationship  between  the  MAMA,  sound  source  velocity, 
and  duration  of  movement  (stimulus  duration).  Adapted 
from  Chandler  and  Grantham  (1992). 

Another  perceptual  characteristic  of  moving  sound  sources  is  the  velocity  threshold  (Grantham, 
1983).  The  velocity  threshold  is  the  minimum  source  velocity  needed  to  detect  sound  source 
movement  in  a  given  constant  period  of  observation.  This  velocity  depends  on  the  duration  of 
the  observation  period  (T)  and  the  sound  spectrum/frequency  (f).  Grantham  (1983)  observed  that 
for  T  =  500  ms,  the  velocity  threshold  was  10°-15°/s  and  about  40°/s  for  sound  sources 
producing  a  pure  tone  of  f  =  250,  500,  or  1000  Hz  and  f  =  2000  Hz,  respectively.  Carlile  and 
Best  (2002)  have  sound  sources  moving  at  15°/s,  30°/s,  and  60°/s  velocities  with  no 
displacement  cue  (constant  dispalcement)  and  reported  velocity  thresholds  of  5.5°,  9.1°,  and 
14.8°,  respectively. 

When  a  displacement  cue  was  included,  the  velocity  thresholds  dropped  to  about  half  of  their 
previous  values.  The  velocity  DL24  is  nearly  linearly  related  to  the  velocity  of  the  sound  source. 
Altman  and  Viskov  (1977)  reported  a  velocity  DL  increasing  from  10.8°/s  to  19.3°/s  for  sound 
source  velocity  increasing  from  14°/s  to  140°/s.  The  listeners  tend  to  underestimate  the  velocity 
of  sound  source  motion  for  short  observation  periods  (30-100  ms)  but  they  are  quite  accurate  for 
sounds  of  longer  durations  (Perrott  et  ah,  1979). 

Using  continuously  varying  ITDs  and  IIDs,  Blauert  (1972)  and  Grantham  and  Wightman  (1978) 
found  that  the  maximum  rate  at  which  movement  around  a  listener  could  be  continuously 
followed  by  the  listener  is  less  than  2-3  Hz  (720°-1080°/s).  At  rates  of  3-6  Hz,  the  listener 
begins  to  hear  a  sound  oscillating  between  the  left  and  right  ear  (Aschoff,  1962;  Blauert,  1972) 
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DL  is  the  differential  threshold  and  is  also  called  just  noticeable  difference  (JND). 
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and  above  about  10-20  Hz,  no  rotational  movement  is  perceived  at  all — just  a  constant  blur  (e.g., 
Aschoff,  1962;  Grantham  and  Wightman,  1979). 

In  the  memory-recall  studies  of  moving  sound  sources,  the  initial  position  of  a  moving  target  is 
usually  displaced  in  the  listener’s  memory  in  the  direction  of  the  target’s  motion  and  this  shift 
needs  to  be  taken  into  account  in  bridging  the  gap  between  perception  and  action  (Hubbard, 
2006).  This  shift  is  sometimes  called  representational  momentum  and/or  the  Frohlich  effect 
(Frohlich,  1923;  Getzmann,  2005).  Displacement  of  the  localized  initial  position  of  the  moving 
sound  source  is  largest  with  the  peripheral  initial  positions  and  decreases  as  the  initial  position 
moves  closer  to  the  median  plane  (Getzmann,  2005).  This  observation  supports  the  notion  that 
spatial  auditory  memory  is  orientation  dependent  (e.g.,  Yamamoto  and  Shelton,  2009).  In  the 
case  of  a  moving  continuous  noise  source,  the  size  of  the  Frohlich  effect  depends  on  the  velocity 
(8°/s  vs.  16°/s)  of  the  moving  sound  source  (larger  LE  for  slow  velocities),  although  the  effect  of 
velocity  seems  to  disappear  for  pulsed  noise  sources  (Getzmann,  Lewald,  and  Guski,  2004). 

In  most  studies  of  the  moving  sound  sources,  the  sound  source  moves  along  a  circular  path 
around  the  listener.  For  this  type  of  sound  source  movement,  the  sound  frequency/spectrum,  and 
the  sound  intensity  at  the  listener’s  location  are  independent  of  sound  source  position,  and  the 
MAMA  is  primarily  dependent  on  binaural  cues  (e.g.,  Dong  et  ah,  2000).  For  linearly  moving 
sound  sources,  the  movement  of  the  source  passing  the  listener  produces  a  change  in  frequency 
and  changes  in  sound  pressure  level  due  to  the  changing  distance  between  the  sound  source  and 
the  listener  (Lee  and  Wang,  2009;  Lufti  and  Wang,  1994;  Rosenblum  et  ah,  1987).  Lufti  and 
Wang  (1999)  and  Kaczmarek  (2005)  reported  that  the  velocity  DL  for  a  sound  source  moving 
along  a  linear  trajectory  is  relatively  independent  of  both  the  initial  velocity  (10-50  m/s  range) 
and  the  initial  position  of  the  sound  source  in  space.  At  low  initial  velocities  of  about  10  m/s, 
changes  in  the  position  of  a  sound  source  moving  at  constant  velocity  are  determined  on  the  basis 
of  interaural  differences  (IIDs  and  ITDs),  and  changes  in  its  velocity  are  determined  on  the  basis 
of  the  Doppler  effect.  At  high  velocities  (about  50  m/s),  the  Doppler  effect  is  the  main  cue  for 
all  discrimination  tasks  (Lufti  and  Wang,  1999).  However,  the  average  velocity  DL  varies 
broadly  across  listeners,  e.g.,  from  1.5  to  4.6  m/s  (Kaczmarek,  2005).  The  results  of  all  these 
studies  suggest  that  a  listener’s  perception  of  the  motion  of  a  moving  sound  source  depends  more 
on  the  changes  in  sound  frequency  and  intensity  than  on  binaural  localization  cues. 

The  MAMA  in  the  median  plane  was  initially  measured  by  Saberi  and  Perrott  (1990)  at  the  0° 
elevation.  They  found  that  it  is  a  U-shaped  function  of  velocity  with  a  minimum  at  7°-l  l°/s. 
Under  these  optimal  velocity  conditions,  the  MAMA  is  about  6°.  Differential  thresholds  (DLs) 
in  median  plane  were  measured  by  Agaeva  (2004)  at  vertical  velocities  of  58°/s  and  1 15°/s.  She 
reported  that  the  DL  values  were  dependent  on  the  type  of  movement  (stepped  vs.  continuous), 
the  sound  spectrum  (higher  DLs  for  low  frequency  noises),  and  the  sound  source  velocity  (larger 
DLs  for  1157s). 
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Saberi  and  Perrott  (1990)  studied  the  MAMA  for  sound  sources  moving  in  diagonal  and  oblique 
planes.  Similarly  to  the  MAA  measured  in  the  same  planes,  the  MAMA  for  the  45°  plane  was 
practically  the  same  as  for  the  0°  plane.  Furthermore,  the  MAMAs  measured  for  the  80°  and  87° 
planes  were  still  substantially  smaller  than  the  MAMAs  measured  for  the  median  (90°)  plane. 

There  are  two  theories  of  motion  perception:  the  snapshot  theory,  according  to  which  the  listener 
compares  the  initial  and  final  angles  of  a  sound  source’s  position  to  evaluate  its  potential  motion, 
and  the  continuous  motion  theory,  according  to  which  a  listener  actually  monitors  the  motion  of 
the  sound  source  (Perrott  and  Marlborough,  1989).  An  argument  for  the  snapshot  theory  is  that  a 
sound  source  does  not  need  to  actually  move  to  create  the  sensation  of  movement.  The  proper 
timing  of  two  acoustic  stimuli  produced  from  two  separate  sound  sources  can  produce  the 
sensation  of  sound  source  motion  called  auditory  apparent  motion  (AAM)  (Strybel  et  ah,  1998). 
Stimulus  timing  is  determined  from  the  durations  of  both  stimuli  and  the  difference  in  their  onset 
times,  called  stimulus  onset  asynchrony  (SOA).  The  spatial  separation  between  the  two  sound 
sources  does  not  affect  the  strength  of  the  AAM  sensation  and  only  affects  the  perceived  velocity 
of  motion  (Perrott  and  Strybel,  1977;  Strybel  et  ah,  1998).  For  example,  Strybel  et  al.  (1990) 
reported  that  two  sound  bursts  with  durations  of  50  ms  and  a  SOA  of  40-60  ms  can  produce  an 
AAM  with  sound  sources  separated  by  as  little  as  6°  or  as  much  as  160°.  Similarly,  Bremer  et  al. 
(1977),  Hari  (1995),  and  Shore  et  al.  (1998)  observed  that  a  click  train  presented  successively  in 
two  spatially  separated  locations  is  perceived  by  listeners  as  smoothly  moving  from  one  location 
to  the  other^s.  However,  these  effects  seem  to  be  observable  only  for  a  limited  range  of 
perceived  velocities  and  interstimulus  intervals.  Grantham  (1997)  compared  the  perceptual 
effects  of  a  sound  source  moving  between  points  A  and  B  and  the  same  sound  source  appearing 
at  point  A  and  after  a  corresponding  delay  at  point  B.  He  observed  that  at  a  velocity  of  20°/s, 
listeners  could  differentiate  both  conditions  and  inferred  that  the  snapshot  theory  was  not 
adequate  to  explain  listeners’  performance.  He  concluded  that  “if  there  is  a  specialized 
mechanism  in  the  auditory  system  sensitive  to  horizontal  motion,  it  apparently  operates  only  in  a 
restricted  range  of  velocities”  (Grantham,  1997,  p.  295).  It  is,  therefore,  very  likely  that  both  of 
the  proposed  mechanisms  of  motion  perception  may  exist,  but  that  they  operate  in  different 
velocity  ranges. 

From  the  practical  standpoint,  an  important  question  is  whether  a  person  hearing  a  moving  sound 
source  can  determine  the  distance  to  the  source.  Several  studies  have  addressed  this  issue  (e.g., 
Rosenblum  et  al.,  1993;  Schiff  and  Oldak,  1990),  but  since  this  is  actually  a  distance  perception 
question,  it  will  only  be  mentioned  here.  According  to  Caelli  and  Porter  (1980),  in  real-life 
situations  people  overestimate  the  distance  to  an  approaching  sound  source  by  approximately  a 
factor  of  two.  In  their  study,  listeners  did  not  react  to  the  sound  of  an  ambulance  siren  until  the 
ambulance  was  less  than  100  m  away.  This  may  partially  be  explained  by  the  loudness 
constancy  hypothesis,  according  to  which  people  do  not  pay  attention  to  changes  in  loudness  that 


^^This  effect  is  called  auditory  saltation. 
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result  from  a  change  in  the  distance  between  them  and  the  sound  source  (e.g.,  Zahorik  and 
Wightman,  2001). 


Sound  source  localization  is  not  only  dependent  on  the  position  and  movement  of  the  sound 
source  itself  but  also  on  the  movement  of  the  listener.  The  results  of  several  studies  indicate  that 
slow,  passive  whole-body  rotations  improve  rather  than  degrade  localization  accuracy  (Perrett 
and  Noble,  1997b;  Thurlow  and  Runge,  1967).  This  finding  is  in  agreement  with  the  notion 
discussed  in  section  2  that  small  head  movements  of  the  listener  improve  localization  accuracy. 
In  contrast,  transcutaneous  vibrations  applied  to  the  posterior  neck  muscles  cause  systematic 
error  (CE)  toward  the  side  of  the  vibrations  (Lewald  et  ah,  1999).  Similarly,  fast  and  extensive 
whole-body  rotations  lead  to  large  CEs  in  the  direction  opposite  to  the  direction  of  rotation  (e.g., 
Jongkees  and  Van  der  Veer,  1958;  Eester  and  Morant,  1970;  Pierce,  1901).  However, 
immediately  after  the  termination  of  rotation,  this  systematic  shift  in  perceived  location  changes 
to  be  in  the  direction  of  the  former  movement  (e.g.,  Miinsterberg  and  Pierce,  1894).  Both  these 
types  of  changes  are  analogous  to  the  auditory  motion  aftereffect  mentioned  in  section  5.  They 
also  suggest  “that  vestibular  information  is  taken  into  account  by  the  brain  for  accurate 
localization  of  stationary  sound  sources  during  natural  head  and  body  motion”  (Lewald  and 
Kamath,  2001). 


15.  Summary  and  Conclusions 


The  simple  act  of  auditory  localization  has  been  the  object  of  numerous  studies  that  have 
produced  a  wealth  of  information  about  the  physical,  physiological,  and  psychological  conditions 
that  affect  the  accuracy  and  precision  of  localization.  The  overall  purpose  of  this  report  was  to 
summarize  our  basic  knowledge  about  the  auditory  localization  process  and  discuss  various  types 
of  localization  tasks,  measures  of  localization  accuracy  and  precision,  and  treatments  of  reversal 
errors  in  order  to  facilitate  effective  and  uniform  collection,  processing,  and  interpretation  of 
sound  localization  data.  Both  the  processing  and  interpretation  of  localization  data  becomes  more 
intuitive  and  simpler  when  the  +180°  scale  is  used  for  data  representation  instead  of  the  0°-360° 
scale,  although  the  0°-360°  scale  can  also  be  successfully  used  with  caution.  One  of  the  main 
problems  with  analyzing  localization  data  is  a  lack  of  clarity  regarding  various  EE  metrics.  To 
guide  in  the  selection  of  appropriate  metrics,  both  linear  and  circular  statistical  analyses  of 
localization  data  were  described,  various  metrics  compared,  and  their  advantages  and  limitations 
stated.  It  has  been  explained  that  the  standard  statistical  measures  for  assessing  constant  and 
random  error  are  not  robust  measures,  as  they  are  quite  susceptible  to  being  overly  influenced  by 
extreme  values  in  the  data  set.  The  robust  measures  discussed  in  this  report  are  intended  to 
provide  researchers  with  alternative  measures  that  may  be  beneficial  for  analyzing  small-sample 
and  unusual  data  distributions.  Another  aspect  of  data  analysis  stressed  in  this  report  was  the 
importance  of  the  separate  processing  of  local  (natural)  localization  errors  and  all  reversal  errors 
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(e.g.,  front-back  errors).  The  sole  use  of  overall  LE  metrics  that  combine  CE  and  RE  has  been 
discouraged  as  has  the  uniform  treatment  of  local  errors  together  with  reversal  errors.  Both  these 
practices  can  lead  to  improper  conclusions. 

As  stated  at  the  beginning  of  this  report  its  goal  is  to  be  a  comprehensive  review  of  auditory 
localization  concepts,  metrics,  and  basic  findings.  “Research  on  human  sound  localization  is 
technically  demanding”  (Wightman  and  Kistler,  1993,  p.  174)  and  a  good  understanding  of 
underlying  principles  and  methodologies  is  important  for  designing  studies  that  measure  what 
they  are  supposed  to  measure.  However,  this  report  is  not  intended  to  be  a  detailed  guide  for 
how  to  set  up  and  run  auditory  localization  studies  since  specific  goals  and  technical  constraints 
may  dictate  various  methodological  approaches.  In  this  respect,  the  basic  set  of  rules  formulated 
at  the  beginning  of  20th  century  by  Angell  (1903)  still  seems  to  provide  valid,  initial  guidance: 

1.  A  variety  of  different  sound  sources  (sounds)  should  be  used. 

2.  Sound  sources  should  produce  sounds  of  controllable  intensity. 

3.  The  listener  should  not  know  the  actual  locations  of  the  sound  sources. 

4.  All  sound  sources  should  be  placed  at  equal  distances  from  the  listener. 

5.  There  should  be  absolutely  no  reflected  sounds  arriving  at  the  listener. 26 

6.  The  listeners  should  have  symmetrical  hearing. 

Decisions  such  as  the  number  of  listeners/judgments,  the  number  of  reference  directions,  the  type 
of  sound  sources  (sounds),  and  listener  instructions  can  vary  enormously  across  studies, 
depending  on  their  specific  goals.  Even  categorical  localization  studies,  which  are  discouraged 
for  use  in  studying  localization  phenomena,  can  sometimes  be  appropriate  when  applied  to 
comparative  assessments  of  equipment  or  combined  with  directional  speech  recognition  tasks. 
The  crux  of  the  matter  is  that  in  such  cases  researchers  should  select  a  sound  source  distribution 
and  formulate  the  research  question  in  such  a  way  that  categorical  data  may  be  easily  converted 
into  absolute  localization  data,  if  needed. 

Our  intent  was  to  provide  a  stable  terminological  base;  outline  the  judgment  and  metrics  options; 
discuss  applied  spatial  perception  research  topics  (directional  audiometry  and  the  localization  of 
multiple  and  moving  sound  sources);  and  provide  estimates  regarding  expected  data.  Although  a 
lot  is  known  about  the  human  ability  to  localize  sound  sources  producing  single,  stationary 
signals,  researchers  are  just  beginning  to  explore  spatially  divided  attention,  spatial  memory,  the 
perception  of  dynamically  changing  spatial  signals,  and  serial  localization  judgments.  The 


26Gerzon  (1971;  1974)  observed  that  in  the  case  of  multichannel  stereo  recordings,  the  addition  of  a  moderate  level  of 
uniformly  distributed  reverberation  energy  to  the  recording  may  sometimes  aid  in  the  localization  of  the  recorded  sound  sources. 
This  may  be  due  to  the  masking  effect  of  the  reverberation  energy  over  some  low-level  discrete  reflections  present  in  the  listening 
space. 
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provided  information  is  intended  to  guide  researchers  in  selecting  both  the  research  issues  and 
the  analytical  tools  to  use  in  documenting  the  investigated  issues. 

The  only  two  strictly  methodological  issues  addressed  in  this  report  are  the  selection  of  a 
direction  pointing  technique  and  the  learning/practice  effect  in  auditory  localization.  The 
preferred  type  of  directional  response  and  listener  leaming/practice  effects  are  the  two  most 
debated  elements  of  localization  study  methodology,  and  therefore,  we  felt  compelled  to  provide 
the  reader  with  background  information  to  help  them  to  make  informed  decisions  in  designing 
their  studies.  However,  both  of  these  topics  are  addressed  outside  of  the  main  body  of  the  report 
as  appendices  A  and  B,  respectively. 
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Appendix  A.  Direction  Pointing 


Beyond  the  nature  of  the  sound  sources  (type,  number,  visibility),  environment  (space  geometry, 
reflections,  atmospheric  conditions,  etc.),  and  listeners  themselves  (type,  number),  another 
important  factor  affecting  the  properties  and  extent  of  LE  is  the  methodology  used  for  data 
collection.  For  example,  in  3-D  auditory  localization,  sounds  can  be  presented  by  a  fixed  array 
of  loudspeakers  (e.g.,  Gilkey  et  al.  1995),  an  arc  of  loudspeakers  that  can  be  rotated  around  a 
fixed  axis  (either  vertically  or  horizontally)  (e.g.,  Makous  and  Middlebrooks,  1990;  Wightman 
and  Kistler,  1989b),  loudspeakers  mounted  on  rotating  booms  (e.g.,  Oldfield  and  Parker,  1984a; 
Otten,  2001),  or  as  phantom  sources  in  a  3-D  virtual  space  presented  through  earphones  (e.g., 
Vermiglio  et  ah,  1998)  (See  also  the  discussion  of  this  topic  in  section  12.)  Other 
methodological  decisions  are  related  to  the  presence  and  type  of  background  noise  and 
distracters,  listener  instructions,  and  the  inclusion  of  the  dynamic  localization  cues. 

One  of  the  most  debated  procedural  elements  of  absolute  auditory  localization  studies  is  the 
selection  of  the  listener’s  overt  response,  that  is,  the  type  of  direction  pointing.  The  type  of 
direction  pointing  used  in  a  study  is  generally  accepted  as  a  factor  contributing  to  the  magnitude 
of  the  LE,  and  localization  researchers  make  efforts  to  minimize  this  effect  through  listener 
training  and  collecting  supplementary  data  (usually  in  the  visual  domain)  on  the  precision  of  the 
response  mechanism  itself.  Localization  discrimination  and  categorical  localization  studies  are 
not  subject  to  pointing-based  localization  error  since  they  rely  only  on  nominal  or  categorical 
responses.  A  list  of  common  techniques  for  direction  pointing  used  in  absolute  localization 
studies  is  presented  in  table  A-1.  All  pointing  techniques  listed  in  the  table  A-1  can  be  generally 
classified  as  egocentric  (body-referenced)  or  exocentric  (externally  referenced)  depending  on  the 
selected  point  of  reference  in  making  directional  decisions. 
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Table  A-1.  Main  pointing  techniques  used  in  auditory  localization  studies. 


ID 

Technique 

Publications  (Examples) 

Comments 

A 

Verbal  estimation 

Cook  and  Frank  (1977) 

Wightman  and  Kistler  (1989b) 
Wenzel  et  al.  (1993) 

Abouchacra  et  al.  (1998a) 

Vause  and  Grantham  (1999) 

Egocentric  technique. 

Verbal  estimation  can  be  in  degrees 
or  in  time  expressed  on  analog  clock 
scale  (e.g.,  3:30) 

B 

Head  (nose)  pointing 

Makous  and  Middlebrooks 
(1990) 

Bronkhorst  (1995) 

Carlile  et  al.  (1997) 

Djelani  et  al.  (2000) 

Ocklenburg  et  al.  (2010) 

Majdak  et  al.  (2010) 

Egocentric  technique. 

In  some  studies  head  tracking 
systems  were  used  (e.g.,  Carlile  et 
al.,  1997). 

C 

Hand  pointing 

Thurlow  and  Runge  (1967) 

Djelani  et  al.  (2000) 

Zwiers  et  al  (2001) 

Majdak  et  al.  (2010) 

Egocentric  technique. 

In  some  studies  3-D  infrared  or 
electro-magnetic  tracking  systems 
were  used  (e.g.,  Zwiers  et  al.,  2001). 

D 

Swivel  pointer® 

Ocklenburg  et  al.  (2010) 

Lewald  (1998) 

Exocentric  technique. 

E 

Gaze  direction 

Frens  and  van  Opstal  (1995) 

Yao  and  Peck  (1997) 

Hofman  and  van  Opstal  (1998) 
Hofman  et  al.  (1998) 

Egocentric  technique. 

F 

Laser  (gun)  pointing 

Oldfield  and  Parker  (1984a) 
Seeber(1997) 

Pedersen  and  Jorgensen  (2005) 
Scharine  (2005) 

Scharine  (2009) 

Egocentric  technique. 

In  some  studies  wooden  or  metal 
pointers  were  used;  Oldfield  and 
Parker  (1984a)  used  a  photographic 
recording  system; 

G 

Sphere  and  stylus 

Hartung  (1995) 

Gilkey  and  Anderson  (1995) 
Gilkey  et  al.  (1995) 

Good  and  Gilkey  (1996) 

Otten  (2001) 

Exocentric  technique. 

God’s  Eye  View  Localization  Point 
(GELP)  technique;  the  size  of  the 
pointing  error  depends  on  the  size  of 
the  sphere; 

H 

Tablet  and  stylus 

Hammershoi  and  Sandvad 
(1994)  Moller  et  al.  (1996) 
Haferkorn  and  Schmid  (1996) 

Exocentric  technique. 

In  some  studies  paper  drawings 
were  used  (e.g.,  Haferkorn  and 
Schmid,  1996) 

I 

Loudspeaker  on  a  boom 

Sandel  et  al.  (1955) 

Egocentric  technique.  Loudspeaker 
emitting  a  reference  signal  is  placed 
at  the  angle  from  which  the  sound 
source  was  perceived. 

J 

Azimuth  table 

Dobbins  and  Kindick  (1967) 

Exocentric  technique. 

Rotating  pointer  mounted  on  a  table. 

K 

Virtual  pointer 

Langendijk  and  Bronkhorst 
(1997)  Langendijk  and 

Bronkhorst  (2002a) 

Pulkki  and  Hirvonen  (2005) 

Exocentric  technique. 

Controlled  by  joystick-like  device. 

“A  mechanical  pointer  that  can  rotate  around  its  fixed  point  of  reference  (e.g.,  midpoint). 
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Another  classification  of  pointing  techniques,  proposed  by  Comalli  and  Altshuler  (1980),  divides 
them  into  four  classes:  (1)  kinesthetic  (e.g.,  pointing  with  a  laser  or  by  turning  the  head), 

(2)  visual  (e.g.,  referring  to  a  map  or  to  numbers  located  at  various  positions  on  a  screen  covering 
the  sound  sources,  (3)  auditory  (e.g.,  loudspeaker  on  a  boom),  and  verbal  (e.g.,  estimating  the 
angle  or  quadrant). 

An  early  comparison  of  different  direction  indication  techniques  was  performed  by  Bauer  and 
Blackmer  (1965),  who  compared  aiming  (aligning  the  head  and  eyes  with  the  direction  of 
pointing)  with  simple  directional  pointing  and  found  no  difference  in  accuracy.  Wightman  and 
Kistler  (1989b)  compared  verbal  responses  using  degree  and  clock  (e.g.,  7  o’clock)  scales  and 
likewise  found  no  difference  between  the  two  methods.  Gilkey  et  al.  (1995)  reported  that  the 
God’s  Eye  View  Localization  Point  (GELP)  technique  (see  table  3),  also  known  as  the  Bochum 
Sphere  (Hartung,  1995),  was  equally  accurate  as  verbal  indications  of  direction  (MUE  <5°)  but 
less  accurate  than  head  (nose)  pointing.  In  addition,  Langendijk  and  Bronkhorst  (1997)  reported 
an  advantage  for  virtual  pointer  techniques  (row  K  in  table  3)  over  verbal  reporting. 

Carlile  et  al.  (1997)  compared  several  pointing  techniques  and  concluded  that  head  (nose) 
pointing  was  more  accurate  than  verbal  estimates  or  the  use  of  a  stylus  with  either  a  sphere  or  a 
tablet.  Majdak  et  al.  (2010)  compared  hand  and  head  (nose)  pointing  and  found  similar 
localization  performance  for  both  methods  for  horizontal  as  well  as  vertical  localization  tasks. 
Razawi  (2009)  compared  gaze  (eye  and  head),  head,  and  eye  pointing  and  found  gaze  pointing  to 
be  more  accurate  than  either  head  or  eye  pointing  alone  (p.  vi).  However,  the  CE  associated  with 
these  pointing  techniques  seems  to  additionally  depend  on  the  handedness  of  the  listener. 
Ocklenburg  et  al.  (2010)  compared  the  localization  accuracy  of  left-  and  right-handed  listeners 
with  the  use  of  head  and  hand  pointing  and  found  that  listeners  demonstrated  a  bias  toward  their 
non-preferred  side  with  both  pointing  methods. 

However,  it  needs  to  be  stressed  that  auditory  localization  accuracy  in  both  the  horizontal  and 
vertical  directions  is  affected  by  eye  position  regardless  of  the  pointing  method.  Eor  example, 
Weerts  and  Thurlow  (1971),  Hartmann  (1983b),  and  Kopinska  and  Harris  (2003)  observed  a 
gaze-related  CE  of  2°-3°  toward  the  direction  of  gaze.  Some  other  authors  have  reported  shifts 
of  similar  magnitude  but  either  in  the  opposite  direction  (Lewald,  1998)  or  inconsistently  in  both 
directions  (Lewald,  1997;  Razavi,  2009).  Getzmann  (2002)  studied  the  effect  of  gaze  direction 
on  localization  in  the  median  plane  and  reported  an  average  shift  of  8.6°  toward  the  direction  of 
eccentric  gaze.  All  these  reports  indicate  that  eye  position  affects  the  perceived  location  of  the 
sound  source  and  that  this  effect  may  be  different  depending  on  the  experimental  conditions.  It 
may  also  be  time-dependent  (Razawi,  2009;  Razawi  et  al.,  2007).  It  is,  therefore,  important  to 
control  for  eye  position  in  studies  of  auditory  spatial  perception  that  are  not  based  on  gaze 
pointing  (Cui  et  al.,  2010).  It  is  also  important  to  realize  that  head-pointing  may  lead  to 
erroneous  results  if  long  sounds  are  used  in  an  azimuth  localization  task  at  elevations  other  than 
0°.  Head-pointing  in  the  vertical  direction  during  the  listening  task  changes  the  listener’s 
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listening  plane.  With  a  tilted  head,  the  listener  is  pointing  in  an  oblique  plane  that  eonstitutes  a 
new  “horizontal”  plane  for  the  listener.  This  may  be  a  different  task  than  is  actually  intended. 

Several  authors  have  also  pointed  out  that  localization  performance  can  be  affected  by  memory. 
For  example,  in  head  or  laser  pointing,  the  listener  first  determines  the  location  of  the  sound 
source  and  then  turns  around  to  indicate  the  remembered  position.  However,  Makous  and 
Middlebrooks  (1990)  argued  that  the  response  technique  appears  to  have  only  a  negligible  effect 
on  localization  performance. 

In  summary,  on  the  basis  of  the  conducted  comparisons  and  meta-analyses  of  localization  studies 
(e.g.,  Djelani  et  ah,  2000),  it  can  be  concluded  that  egocentric  systems  (pointing  toward  the 
sound  source  or  verbally  indicating  its  position)  are  generally  more  precise  than  exocentric 
systems  (using  a  display  screen,  drawings  on  paper,  a  response  sphere,  etc.),  especially  for 
listeners  with  no  or  minimal  experience  in  using  the  specific  pointing  system.  The  most  precise 
technique  seems  to  be  the  laser  pointing  technique.  Seeber  (1997),  for  example,  reported  errors 
on  the  order  of  only  0.2°  for  laser  pointing,  which  seem  to  be  an  order  of  magnitude  smaller  than 
the  errors  reported  for  other  methods.  It  seems  that  the  laser  beam  provides  important  visual 
feedback  to  the  listener  leading  to  more  accurate  sound  source  localization  (Razavi,  2009, 

p.  216). 
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Appendix  B.  Localization  Training 


Performance  in  perceptual  tasks  improves  with  practice,  and  this  process  is  called  perceptual 
learning.  If  this  process  is  structured  by  providing  some  form  of  feedback  or  adapting 
instructions,  it  is  frequently  referred  to  as  perceptual  training,  behavioral  training,  or  perceptual 
skills  development.  Most  studies  of  auditory  learning/training  have  demonstrated  a  high 
plasticity  of  the  human  auditory  system  in  performing  a  variety  of  discrimination  tasks  (Fahle 
and  Poggio,  2002;  Habib  and  Besson,  2009;  Policy  et  ah,  2006).  The  maximal  sensitivity  to 
sensory  exposure  exists  during  the  early  postnatal  developmental  period  and  gradually  decreases 
with  as  the  brain  matures.  However,  certain  internal  rewiring  of  brain  regions  can  be  seen  to 
occur  even  in  older  people  (e.g.,  Spolidoro  et  ah,  2009).  A  general  discussion  of  spatial 
adaptation  can  be  found  in  Welch  (1986). 

One  aspect  of  audition  that  might  be  expected  to  be  especially  affected  by  sensory  experience  is 
spatial  perception  (King,  1999).  Sound,  unlike  visual  or  tactile  stimuli,  has  no  specific  location 
(Nudds,  2001;  O’Shaughnessy,  2002,  p.  446).  Therefore,  the  brain  has  to  determine  where  the 
location  of  the  sound  source  on  the  basis  of  a  variety  of  localization  cues.  Such  a  situation  lends 
itself  to  gradual  improvements  in  sound  processing  by  the  brain,  resulting  in  improved  auditory 
spatial  perception.  However  the  data  provided  by  psychoacoustic  studies  to  date  do  not  present  a 
clear  picture  of  how  repeated  exposure  to  the  same  set  of  spatial  situations  affects  a  listener’s 
general  ability  to  localize  sound  sources. 

The  learning  of  auditory  localization  skills  may  be  considered  as  the  effect  of  practice  (repetition 
without  feedback)  or  training27  (practice  with  feedback),  and  may  involve  natural  or  altered 
localization  cues.  Natural  localization  cues  are  the  cues  that  a  person  has  already  been  using, 
while  altered  cues  arise  when  natural  cues  change  due  to  asymmetrical  hearing  loss,  pinna 
modification,  the  use  of  single  hearing  aid,  etc. 

The  data  reported  in  the  literature  regarding  the  effect  of  practice  (no  feedback)  on  absolute 
auditory  localization  with  natural  localization  cues  are  contradictory.  Several  authors  have 
reported  no  or  insignificant  practice  effect  (e.g.,  Davis  and  Stephens,  1974;  Carlile  et  ah,  1997; 
Giguere  and  Abel,  1993;  Hartmann,  1983a;  Russell,  1976;  Savel,  2009;  Zwiers  et  al.  2001; 
Zahorik  et  al.,  2001;  Zahorik  et  al.  2006).  This  finding  seems  to  be  independent  of  whether  the 
listeners  have  or  have  not  had  previous  training  (e.g.,  Wersenyi,  2009;  Zahorik  et  al.,  2001). 
However,  there  are  also  some  reports  indicating  that  simple  practice  may  have  an  effect  on 
localization  performance.  For  example,  Jacobsen  (1976)  reported  that  the  MAA  threshold 
gradually  improved  from  1.7°  at  the  beginning  of  data  collection  (first  eight  series)  to  0.75°  at 
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^'Unfortunately,  the  terms  practice  and  training  are  frequently  used  in  the  literature  interchangeably  and  practice  without 
feedback  or  some  form  of  guiding  instructions  is  also  frequently  described  as  training. 
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the  end  of  data  collection  (last  eight  series).  Wright  and  Fitzgerald  (2001)  and  Abel  and  Paik 
(2004)  observed  improved  absolute  localization  performance  with  practice  for  high  frequency 
sound  sources  (4  kHz;  IIL  cues)  but  not  for  low  frequency  sources  (500  Hz,  ITD  cues).  Minnaar 
et  al.  (2001)  studied  localization  accuracy  with  binaural  recordings  made  with  several  artificial 
heads  and  reported  continuous  improvement  over  a  period  of  five  days.  Honda  et  al.  (2007) 
reported  that  two  weeks  of  game  playing  in  an  AVR  substantially  improved  the  players’  AVR 
localization  accuracy  in  both  the  horizontal  and  vertical  planes.  However,  for  studies 
demonstrating  the  effectiveness  of  practice  on  localization  performance  in  AVR  environments, 
the  question  arises  as  to  what  extent  the  observed  improvements  are  the  results  of  the  practice 
itself  and  to  what  extent  they  are  the  effects  of  procedural  learning  and  adaptation  to  the  AVR 
environment  (Hawkey  et  al.  2004).  In  addition,  an  increase  in  localization  performance  due  to 
game  playing  cannot  be  considered  a  simple  practice  effects  since  game  progress  provides 
natural  feedback  to  the  player.  Similarly,  in  some  longer  lasting  practice  studies,  the  listeners 
may  receive  unintentional  behavioral  or  ecological  feedback  and  learn  where  the  actual  sound 
sources  are  physically  located. 

In  contrast  to  the  unclear  effect  of  simple  practice,  the  majority  of  literature  reports  are  in 
agreement  that  providing  feedback-based  or  multimodal  localization  training  prior  to  the 
auditory  localization  study  is  effective  in  reducing  front-back  errors  and  improving  overall 
localization  performance  (e.g.,  Makous  and  Middlebrooks,  1990;  Martin  et  al.,  2001;  Park,  1996; 
Pearce,  1937;  Perrott  et  al.,  1969;  Wright  and  Zhang,  2006;  2009;  Zahorik  et  al.,  2006).  For 
example,  Zahorik  et  al.  (2001)  reported  that  visual  feedback  training  markedly  improved  the 
localization  accuracy  of  their  listeners,  with  the  improvement  appearing  to  last  for  several  days. 
Majdak  et  al.  (2010)  observed  a  large  training  effect  (with  feedback)  for  about  the  first  400  trials 
(3-4  h)  of  a  sound  localization  task  and  a  smaller  improvement  beyond  those  400  trials.  The 
accuracy  and  precision  of  the  judgments  increased,  and  the  number  of  front-back  errors 
decreased.  In  contrast,  Terhune  (1985)  reported  no  benefit  with  feedback-supported  short-term 
practice  (50  trials). 

Despite  several  accounts  of  effective  adaptation  to  new  sets  of  localization  cues,  the  overall 
results  of  the  reviewed  studies  lead  to  the  conclusion  that  although  adaptation  to  new  localization 
cues  is  generally  fully  successful  in  the  median  plane,  it  is  frequently  only  partially  successful  in 
the  horizontal  plane  (e.g.,  Javer  and  Schwartz,  1995;  Shinn-Cunningham  et  al.,  1998a;  Wright 
and  Zhang,  2006).  The  complete  or  partial  adaptation  or  re-adaptation  process  is  asymptotic  and 
has  been  reported  to  take  about  7-14  days  (e.g.,  McPartland  et  al.,  1997;  Van  Wanrooij  and  Van 
Opstal,  2005),  although  some  adaptation  can  already  be  observed  within  1-2  h  (e.g.,  Wright  and 
Zhang,  2006).  In  contrast,  other  authors  did  not  observe  any  adaptation  effects  in  localization 
performance  after  24  h  (Slattery  and  Middlebrooks,  1994)  or  several  days  (McPartland  et  al., 
1997)  of  continuous  use  of  a  unilateral  earplug.  In  general,  training  is  most  effective  if  repeated 
every  day  and  single-day  training  session  has  never  been  shown  to  have  a  lasting  effect. 
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It  is  important  to  stress  that  many  authors  have  reported  very  large  individual  differences  in 
localization  performance  among  listeners  (e.g.,  Javer  and  Schwarts,  1995;  Langford,  1994; 
Shinn-Cunningham  et  ah,  1998a;  Wenzel  et  ah,  1993;  Wright  and  Fitzgerald,  2001),  leading  to 
the  concepts  of  good  localizer  and  poor  localizer.  Some  authors  attribute  this  ability  to  specific 
anatomical  differences  in  the  shape  and  size  of  the  head,  pinna,  and  concha  (e.g.,  Middlebrooks 
and  Green,  1991;  Wightman  and  Kistler,  1989b).  Saberi  and  Antonio  (2003)  further  noticed  that 
poor  localizers  have  a  tendency  to  improve  their  localization  performance  with  training  while 
good  localizers  do  not.  These  results  seem  to  suggest  that  the  difference  between  good  and  poor 
localizers  is  not  only  physiological  but  may  also  result  from  previous  exposure  to  the  variety  of 
spatial  environments  and  from  lifestyle. 

Although  there  is  a  lack  of  unequivocal  evidence  that  people  improve  their  localization  abilities 
after  short-term  practice  before  or  during  the  course  of  an  experiment,  there  is  little  doubt  that 
some  long-term  adaptation  (on  the  order  of  days  and  weeks)  takes  place  to  altered  localization 
cues.  Long-term  adaptation  to  new  localization  cues  takes  place  continuously  during  a  child’s 
developmental  as  the  size  of  the  head  gradually  increases,  but  it  can  also  occur  in  adulthood. 
Most  people  can  adapt  to  unilateral  hearing  loss  (Gardner  and  Gardner,  1973;  Florentine,  1976; 
Nabelek  et  al.  1980)  and  hearing  aids  (see  Byrne  and  Dirks  [1996]  for  an  overview)  and  re-learn 
to  localize  sound  sources  correctly  after  external  ear  surgery  or  other  modification  to  their  ears 
(e.g.,  Musicant  and  Butler,  1980;  Butler,  1987;  Oldfield  and  Parker,  1984b;  Hofman  et  al.  1998; 
Shinn-Cunningham  et  al.  1998a).  This  adaptation  to  new  cues  seems  to  also  apply  to 
preprocessed  cues  that  simulate  larger-than-normal  head  size  and  make  better-than-normal 
spatial  resolution  possible,  which  is  of  special  interest  to  military  researchers  (e.g.,  Shinn- 
Cunningham  and  Durlach,  1994). 

Shinn-Cunningham  et  al.  (1998ab)  studied  the  effect  of  synthesized  supernormal  localization 
cues  on  spatial  perception.  While  supernormal  localization  cues  can  improve  localization 
discrimination  (see  section  8),  they  cause  a  shift  in  the  apparent  location  of  the  sound  source 
simulated  by  the  cues  and  may  worsen  the  accuracy  of  absolute  localization.  The  authors 
concluded  that  training  reduced  the  size  of  absolute  CE  but  also  that  the  listeners  never 
completely  adapted  to  the  new  set  of  cues.  Such  incomplete  adaptation  is  consistent  with 
previous  reports  (e.g.,  Welch,  1986).  Another  observation  made  by  the  authors  was  that  the 
listeners  were  “able  to  accommodate  only  linear  transformations  of  cues,  rather  than  being  able 
to  adapt  to  arbitrary  complex  remappings”  (Shinn-Cunningham  et  ah,  1998b,  p.  3675). 

Three  additional  comments  need  to  be  made  with  respect  to  adapting  to  altered  localization  cues: 

1.  It  seems  that  the  adaptation  process  cannot  be  generalized  to  stimuli  that  are  very  different 
from  those  used  in  the  practice/training  (Feinstein,  1973;  Butler,  1987). 

2.  Adaptation  seems  to  be  asymmetrical  and  is  greater  in  the  left  than  the  right  hemifield 
(Wells  and  Ross,  1980;  Shinn-Cunningham  et  al.,  1998a;  Savel,  2009). 
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3.  The  available  data  (e.g.,  Kumpik  et  ah,  2010;  Nabelek  et  ak,  1980)  indicate  that  people  can 
quickly  re-leam  natural  localization  cues  after  the  cause  of  the  altered  cues  has  been 
removed,  indicating  that  the  natural  neural  traces  in  the  brain  are  not  significantly  altered 
by  learning  the  new  cues. 

In  addition,  it  is  noteworthy  that  people  are  completely  unable  to  adapt  to  the  reversal  of  left  and 
right  ear  cues  (Young,  1928;  Hofman  et  ah,  2002). 

The  same  capacity  for  plasticity  in  auditory  localization  described  earlier  for  adult  humans  has 
also  been  reported  for  other  animals  (e.g.,  Knudsen  el  ah,  1984;  Knudsen,  1984;  1985;  King  et 
ah,  201 1).  In  addition,  both  human  and  animal  studies  indicate  that  brain  wiring  cannot  be 
changed  without  previous  normal  binaural  experience  (e.g.,  Knudsen,  1985;  King  and  Carlile, 
1993).  For  example,  Wilmington  et  al.  (1994)  reported  that  the  surgical  correction  of  congenital 
unilateral  hearing  loss  did  not  restore  normal  binaural  hearing.  Even  a  long  time  after  the 
surgery,  the  spatial  auditory  capabilities  that  require  the  integration  of  basic  binaural  cues  had 
not  been  restored.  Together  these  findings  support  the  notion  that  the  neural  mechanisms 
underlying  auditory  spatial  perception  are  dependent  on  initial  auditory  exposure  for  proper 
development  (Mrsic-Flogel  et  al.,  2001).  In  addition,  animal  studies  indicate  that  the  duration  of 
the  after-effect  resulting  from  the  removal  of  a  monaural  earplug  seems  to  be  species  dependent 
(King  et  al.,  2011). 


One  difficulty  with  comparing  the  effects  of  practice  and  training  on  localization  performance 
reported  in  various  studies  is  that  most  reports  provide  qualitative  or  raw  quantitative  data 
without  any  formal  data  analysis.  In  addition,  these  effects  are  normally  discussed  for  overall  LE 
without  separate  considerations  for  CE  and  RE.  A  simple  method  of  determining  the  effect  of 
training  on  the  size  of  RE  is  to  use  a  variant  of  Fisher’s  F-test  (variance  ratio  test)  (Fisher,  1920), 
that  is,  by  calculating  the  ratio  of  the  data  variances  before  and  after  training 


V 

prior 

^post 


SD  ■ 
prior 

SD^ost 


(49) 


where  Vprior,  Vpost,  SDpnor,  and  SDpost  are,  respectively,  the  variances  and  standard  deviations  of  the 
data  collected  in  the  localization  test  before  and  after  training.  Alternatively,  any  other  similar 
test  of  equality  for  two  variances  can  be  used  (see  any  standard  statistical  software  package  or 
textbook). 

A  convenient  measure  of  the  effect  of  practice  or  short-term  training  on  CE  is  Cohen’s  d  defined 
as 


d  = 


Vprior  ^post 
SD 


(50) 
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where  Xprwr  and  Xpost  are  the  arithmetic  means  of  the  judgments  made  prior  to  and  after  training 
and  SD  is  the  pooled  standard  deviation  calculated  as 


where  SDpnor  and  SDpost  have  the  same  meaning  as  in  equation  48  and  n  is  the  number  of 
judgments  made  by  the  listener.  Cohen’s  d  is  a  measure  of  ejfect  size,  and  by  convention,  an 
effect  size  of  +0.2  is  small,  +0.5  is  moderate,  and  +0.8  or  greater  is  large  (Cohen  1988;  1992). 
Note  that  Cohen’s  d  may  be  larger  than  1.  A  good  tutorial  on  the  use  of  various  measures  of 
effect  size  is  provided  by  Thalheimer  and  Cook  (2002). 

A  good  summary  of  the  effects  of  practice,  training,  and  adaptation  on  sound  source  localization 
is  available  in  Wright  and  Zhang  (2006).  They  concluded  that  although  human  adaptation  to 
altered  sound  localization  cues,  either  complete  or  partial,  has  been  well  established,  the 
evidence  of  a  practice  effect  is  unclear. 

Finally,  Durlach  and  Pang  (1986)  and  Rabinowitz  et  al.  (1993)  showed  that  the  proper  frequency 
scaling  of  an  individual’s  HRTF  (and  the  distance  to  the  sound  source)  can  produce  HRTFs  for  a 
similar  but  larger  head  size  and  result  in  improved  localization  resolution.  Another  type  of  HRTF 
manipulation  that  preserves  the  same  ITDs  and  IIDs  but  reassigns  them  to  different  angles  of 
sound  arrival  was  described  by  Durlach  et  al.  (1993).  Such  transformation  can  increase  spatial 
resolution  in  the  frontal  direction  but  decrease  it  along  the  interaural  axis 


145 


List  of  Symbols,  Abbreviations,  and  Acronyms 


2-D 

two-dimensional 

3-D 

three-dimensional 

AAM 

apparent  auditory  motion 

AES 

anterior  ectosylvian  sulcus 

ALE 

Auditory  Eocalization  Eacility 

AMA 

auditory  motion  aftereffect 

AN 

auditory  nerve 

ANSI 

American  National  Standards  Institute 

APA 

American  Psychology  Association 

ARE 

U.S.  Army  Research  Eaboratory 

ASA 

auditory  scene  analysis 

AVR 

auditory  virtual  reality 

BE 

back-front 

CC 

corpus  callosum 

CE 

constant  error 

CMAA 

concurrent  minimum  audible  angle 

CMR 

Coordinated  Measure  Response 

CN 

cochlear  nucleus 

CNS 

central  nervous  system 

CRT 

Choice  Reaction  Time 

DCN 

dorsal  cochlear  nucleus 

DDT 

directional  detection  threshold 

DE 

difference  limen 

DOA 

direction  of  arrival 
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EE 

excitatory-excitatory 

El 

excitatory-inhibitory 

EB 

front-back 

GEEP 

God’s  Eye  Localization  Viewing 

GoE 

goodness-of-fit 

HMR 

head  movement  response 

HRTE 

head-related  transfer  function 

IC 

inferior  colliculus 

lED 

interaural  envelope  difference 

IID 

interaural  intensity  difference 

IPD 

interaural  phase  difference 

ISD 

interaural  spectrum  difference 

ISI 

Interstimulus  Interval 

ITD 

interaural  time  difference 

KEMAR 

Knowles  Electronic  Manikin  for  Acoustic  Research 

EE 

localization  error 

EGoE 

localization  goodness  of  fit 

EE 

lateral  lemniscus 

ESO 

lateral  superior  olivary 

MAA 

minimum  audible  angle 

MAD 

mean  absolute  deviation 

MAMA 

mean  audible  moving  angle 

MD 

median 

ME 

mean  (signed)  error 

MEAD 

median  absolute  deviation 

MGB 

medial  geniculate  body 

MSO 

medial  superior  olivary 
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MUE 

RE 

RMSE 

S 

SAINT 

SC 

SD 

SE 

SEK 

SES 

SE 

SE 

SNR 

SOA 

SOC 

SP 

SRT 

TB 

TCAPSs 

VCN 

VE 

WWM 


mean  unsigned  (absolute)  error 
random  error 
root-mean-squared  error 
skew 

Source  Azimuth  Identification  in  Noise  Test 

superior  colliculus 

standard  deviation 

standard  error 

standard  error  of  kurtosis 

standard  error  of  skew 

Fisher’s  skew 

sensation  level 

signal-to-noise  ratio 

stimulus  onset  asynchrony 

superior  olivary  complex 

Parson’s  skew 

simple  reaction  time 

trapezoid  body 

tactical  communication  and  protection  systems 
ventral  cochlear  nucleus 
ventriloquism  effect 
Wheeler-W  atson-Mardia 
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